[jira] [Commented] (ARROW-11157) [Python] Consistent handling of categoricals

Joris Van den Bossche (Jira) Mon, 25 Jan 2021 07:38:36 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-11157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17271372#comment-17271372
 ]


Joris Van den Bossche commented on ARROW-11157:
-----------------------------------------------

Additional answer specifically about this:

bq.  The `categories` parameter mentioned in this GitHub issue does not seem to 
be accepted in `pd.read_parquet` anymore

It certainly still works in the {{to_pandas}} method:

{code}
In [89]: pa.table({'int': [1, 2]}).to_pandas().dtypes
Out[89]: 
int    int64
dtype: object

In [90]: pa.table({'int': [1, 2]}).to_pandas(categories=['int']).dtypes
Out[90]: 
int    category
dtype: object
{code}

It might be that this worked at some point directly in the {{pd.read_parquet}} 
function, but indeed not at the moment. The main problem here is that 
additional keywords ({{kwargs}}) can be passed either to parquet.read_table or 
to Table.to_pandas. And right now they are passed to {{read_table}}. We should 
probably have a way to specify keywords for {{to_pandas}} as well. 

As a workaround, you can read with pyarrow and do the conversion to pandas 
manually. So basically instead of {{pd.parquet(..)}} you can do 
{{pyarrow.parquet.read_table(..).to_pandas(..)}}.

> [Python] Consistent handling of categoricals
> --------------------------------------------
>
>                 Key: ARROW-11157
>                 URL: https://issues.apache.org/jira/browse/ARROW-11157
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 2.0.0
>            Reporter: Chris Roat
>            Priority: Minor
>
> What is the current state of categoricals with pyarrow? The `categories` 
> parameter mentioned [in this 
> GitHub|https://github.com/apache/arrow/issues/1688] issue does not seem to be 
> accepted in `pd.read_parquet` anymore. I see that read/write of `int` 
> categoricals does not work, though `str` do -- except if the file is written 
> by fastparquet.
> Using pandas 1.1.5, pyarrow 2.0.0, and fastparquet 0.4.1, I see the following 
> handling of categoricals:
>  
> {code:java}
> import os
> import pandas as pd
> fname = '/tmp/tst'
> data = {
>     'int': pd.Series([0, 1] * 1000, dtype=pd.CategoricalDtype([0,1])),
>     'str': pd.Series(['foo', 'bar'] * 1000, dtype=pd.CategoricalDtype(['foo', 
> 'bar'])),
> }
> df = pd.DataFrame(data)
> for write in ['fastparquet', 'pyarrow']:
>     for read in ['fastparquet', 'pyarrow']:
>         if os.path.exists(fname):
>             os.remove(fname)
>         df.to_parquet(fname, engine=write, compression=None)
>         df_read = pd.read_parquet(fname, engine=read)
>         print()
>         print('write:', write, 'read:', read)
>         for t in data.keys():
>             print(t, df[t].dtype == df_read[t].dtype){code}
>  
>  
> {noformat}
> write: fastparquet read: fastparquet
> int True
> str True
> write: fastparquet read: pyarrow
> int False
> str False
> write: pyarrow read: fastparquet
> int True
> str True
> write: pyarrow read: pyarrow
> int False
> str True
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11157) [Python] Consistent handling of categoricals

Reply via email to