chrisroat commented on issue #1688:
URL: https://github.com/apache/arrow/issues/1688#issuecomment-751128963


   What is the current state of categoricals with pyarrow?  The `categories` 
parameter mentioned above does not seem to be accepted in `pd.read_parquet` 
anymore.     I see that read/write of `int` categoricals does not work, though 
`str` do -- except if the file is written by fastparquet.
   
   Using pandas 1.1.5, pyarrow 2.0.0, and fastparquet 0.4.1, I see the 
following handling of categoricals:
   
   ```
   import os
   import pandas as pd
   
   fname = '/tmp/tst'
   
   data = {
       'int': pd.Series([0, 1] * 1000, dtype=pd.CategoricalDtype([0,1])),
       'str': pd.Series(['foo', 'bar'] * 1000, 
dtype=pd.CategoricalDtype(['foo', 'bar'])),
   }
   df = pd.DataFrame(data)
   
   for write in ['fastparquet', 'pyarrow']:
       for read in ['fastparquet', 'pyarrow']:
           if os.path.exists(fname):
               os.remove(fname)
           df.to_parquet(fname, engine=write, compression=None)
           df_read = pd.read_parquet(fname, engine=read)
   
           print()
           print('write:', write, 'read:', read)
           for t in data.keys():
               print(t, df[t].dtype == df_read[t].dtype)
   ```
   
   ```
   write: fastparquet read: fastparquet
   int True
   str True
   
   write: fastparquet read: pyarrow
   int False
   str False
   
   write: pyarrow read: fastparquet
   int True
   str True
   
   write: pyarrow read: pyarrow
   int False
   str True
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to