[jira] [Commented] (ARROW-11157) [Python] Consistent handling of categoricals

A (Jira) Sun, 21 Aug 2022 19:27:07 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-11157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17582693#comment-17582693
 ]


A commented on ARROW-11157:
---------------------------

I need the following to work: 

Given a parquet file, automatically load it and convert to Pandas DataFrame 
with correct categorical columns (which are required by XGBoost for example).

The suggested workarounds do not work since this has to be done 
programmatically without prior knowledge of any schema - based on the parquet 
file alone.

> [Python] Consistent handling of categoricals
> --------------------------------------------
>
>                 Key: ARROW-11157
>                 URL: https://issues.apache.org/jira/browse/ARROW-11157
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 2.0.0
>            Reporter: Chris Roat
>            Priority: Minor
>
> What is the current state of categoricals with pyarrow? The `categories` 
> parameter mentioned [in this 
> GitHub|https://github.com/apache/arrow/issues/1688] issue does not seem to be 
> accepted in `pd.read_parquet` anymore. I see that read/write of `int` 
> categoricals does not work, though `str` do -- except if the file is written 
> by fastparquet.
> Using pandas 1.1.5, pyarrow 2.0.0, and fastparquet 0.4.1, I see the following 
> handling of categoricals:
>  
> {code:java}
> import os
> import pandas as pd
> fname = '/tmp/tst'
> data = {
>     'int': pd.Series([0, 1] * 1000, dtype=pd.CategoricalDtype([0,1])),
>     'str': pd.Series(['foo', 'bar'] * 1000, dtype=pd.CategoricalDtype(['foo', 
> 'bar'])),
> }
> df = pd.DataFrame(data)
> for write in ['fastparquet', 'pyarrow']:
>     for read in ['fastparquet', 'pyarrow']:
>         if os.path.exists(fname):
>             os.remove(fname)
>         df.to_parquet(fname, engine=write, compression=None)
>         df_read = pd.read_parquet(fname, engine=read)
>         print()
>         print('write:', write, 'read:', read)
>         for t in data.keys():
>             print(t, df[t].dtype == df_read[t].dtype){code}
>  
>  
> {noformat}
> write: fastparquet read: fastparquet
> int True
> str True
> write: fastparquet read: pyarrow
> int False
> str False
> write: pyarrow read: fastparquet
> int True
> str True
> write: pyarrow read: pyarrow
> int False
> str True
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-11157) [Python] Consistent handling of categoricals

Reply via email to