[jira] [Commented] (ARROW-11157) [Python] Consistent handling of categoricals

Joris Van den Bossche (Jira) Mon, 25 Jan 2021 07:31:08 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-11157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17271370#comment-17271370
 ]


Joris Van den Bossche commented on ARROW-11157:
-----------------------------------------------

Looking a bit into this. Converting the pandas DataFrame to pyarrow Table still 
preserves the categorical (dictionary) data type:

{code}
In [55]: table = pa.table(df)

In [56]: table
Out[56]: 
pyarrow.Table
int: dictionary<values=int64, indices=int8, ordered=0>
str: dictionary<values=string, indices=int8, ordered=0>
{code}

but when reading, we restore the dictionary type for the string column but not 
for the int column (as you already observed):

{code}
In [58]: pq.write_table(table, "test_categoricals.parquet")

In [59]: pq.write_table?

In [60]: pq.read_table("test_categoricals.parquet")
Out[60]: 
pyarrow.Table
int: int64
str: dictionary<values=string, indices=int32, ordered=0>
{code}

We store the original Arrow schema metadata, so we can restore the original 
types (since parquet doesn't have the concept of a separate 
dictionary/categorical type, only uses that as compression). The code where 
this is done lives here: 
https://github.com/apache/arrow/blob/b3b62412140182a0ac7b4f585fac27c8c57f1662/cpp/src/parquet/arrow/schema.cc#L874-L884
 
The {{IsDictionaryReadSupported}} in the if-check is basically checking if the 
data type is binary or string, so we only restore the dictionary type for those 
data types and not for integer (and thus when finally converting to pandas, you 
also don't see a categorical data type for the integer column). 

I am not familiar enough with this part of the codebase to really know the 
reason for this, but I suppose it is related to ARROW-6140. For string/binary 
values, we can read the parquet data directly into an Arrow DictionaryArray, 
while this is not yet supported for other types. So if we would currently want 
to restore the integer columns also as dictionary type, that would basically 
mean to do a dictionary encoding after the fact of the materialized integer 
column (as long as ARROW-6140 is not implemented). 

It is also documented that the {{read_dictionary}} keyword is (for this reason) 
only valid for string/bytes columns: 
https://arrow.apache.org/docs/python/parquet.html#reading-types-as-dictionaryarray

Now, the above is the actual parquet reading. If we know, based on the pandas 
dtypes information we store in the metadata, that the original column was a 
categorical column, we could also still preserve the categorical dtype on 
conversion from pyarrow -> pandas (but that's something we currently also don't 
do for eg string columns).

> [Python] Consistent handling of categoricals
> --------------------------------------------
>
>                 Key: ARROW-11157
>                 URL: https://issues.apache.org/jira/browse/ARROW-11157
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 2.0.0
>            Reporter: Chris Roat
>            Priority: Minor
>
> What is the current state of categoricals with pyarrow? The `categories` 
> parameter mentioned [in this 
> GitHub|https://github.com/apache/arrow/issues/1688] issue does not seem to be 
> accepted in `pd.read_parquet` anymore. I see that read/write of `int` 
> categoricals does not work, though `str` do -- except if the file is written 
> by fastparquet.
> Using pandas 1.1.5, pyarrow 2.0.0, and fastparquet 0.4.1, I see the following 
> handling of categoricals:
>  
> {code:java}
> import os
> import pandas as pd
> fname = '/tmp/tst'
> data = {
>     'int': pd.Series([0, 1] * 1000, dtype=pd.CategoricalDtype([0,1])),
>     'str': pd.Series(['foo', 'bar'] * 1000, dtype=pd.CategoricalDtype(['foo', 
> 'bar'])),
> }
> df = pd.DataFrame(data)
> for write in ['fastparquet', 'pyarrow']:
>     for read in ['fastparquet', 'pyarrow']:
>         if os.path.exists(fname):
>             os.remove(fname)
>         df.to_parquet(fname, engine=write, compression=None)
>         df_read = pd.read_parquet(fname, engine=read)
>         print()
>         print('write:', write, 'read:', read)
>         for t in data.keys():
>             print(t, df[t].dtype == df_read[t].dtype){code}
>  
>  
> {noformat}
> write: fastparquet read: fastparquet
> int True
> str True
> write: fastparquet read: pyarrow
> int False
> str False
> write: pyarrow read: fastparquet
> int True
> str True
> write: pyarrow read: pyarrow
> int False
> str True
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11157) [Python] Consistent handling of categoricals

Reply via email to