[
https://issues.apache.org/jira/browse/ARROW-11157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17271370#comment-17271370
]
Joris Van den Bossche commented on ARROW-11157:
-----------------------------------------------
Looking a bit into this. Converting the pandas DataFrame to pyarrow Table still
preserves the categorical (dictionary) data type:
{code}
In [55]: table = pa.table(df)
In [56]: table
Out[56]:
pyarrow.Table
int: dictionary<values=int64, indices=int8, ordered=0>
str: dictionary<values=string, indices=int8, ordered=0>
{code}
but when reading, we restore the dictionary type for the string column but not
for the int column (as you already observed):
{code}
In [58]: pq.write_table(table, "test_categoricals.parquet")
In [59]: pq.write_table?
In [60]: pq.read_table("test_categoricals.parquet")
Out[60]:
pyarrow.Table
int: int64
str: dictionary<values=string, indices=int32, ordered=0>
{code}
We store the original Arrow schema metadata, so we can restore the original
types (since parquet doesn't have the concept of a separate
dictionary/categorical type, only uses that as compression). The code where
this is done lives here:
https://github.com/apache/arrow/blob/b3b62412140182a0ac7b4f585fac27c8c57f1662/cpp/src/parquet/arrow/schema.cc#L874-L884
The {{IsDictionaryReadSupported}} in the if-check is basically checking if the
data type is binary or string, so we only restore the dictionary type for those
data types and not for integer (and thus when finally converting to pandas, you
also don't see a categorical data type for the integer column).
I am not familiar enough with this part of the codebase to really know the
reason for this, but I suppose it is related to ARROW-6140. For string/binary
values, we can read the parquet data directly into an Arrow DictionaryArray,
while this is not yet supported for other types. So if we would currently want
to restore the integer columns also as dictionary type, that would basically
mean to do a dictionary encoding after the fact of the materialized integer
column (as long as ARROW-6140 is not implemented).
It is also documented that the {{read_dictionary}} keyword is (for this reason)
only valid for string/bytes columns:
https://arrow.apache.org/docs/python/parquet.html#reading-types-as-dictionaryarray
Now, the above is the actual parquet reading. If we know, based on the pandas
dtypes information we store in the metadata, that the original column was a
categorical column, we could also still preserve the categorical dtype on
conversion from pyarrow -> pandas (but that's something we currently also don't
do for eg string columns).
> [Python] Consistent handling of categoricals
> --------------------------------------------
>
> Key: ARROW-11157
> URL: https://issues.apache.org/jira/browse/ARROW-11157
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Affects Versions: 2.0.0
> Reporter: Chris Roat
> Priority: Minor
>
> What is the current state of categoricals with pyarrow? The `categories`
> parameter mentioned [in this
> GitHub|https://github.com/apache/arrow/issues/1688] issue does not seem to be
> accepted in `pd.read_parquet` anymore. I see that read/write of `int`
> categoricals does not work, though `str` do -- except if the file is written
> by fastparquet.
> Using pandas 1.1.5, pyarrow 2.0.0, and fastparquet 0.4.1, I see the following
> handling of categoricals:
>
> {code:java}
> import os
> import pandas as pd
> fname = '/tmp/tst'
> data = {
> 'int': pd.Series([0, 1] * 1000, dtype=pd.CategoricalDtype([0,1])),
> 'str': pd.Series(['foo', 'bar'] * 1000, dtype=pd.CategoricalDtype(['foo',
> 'bar'])),
> }
> df = pd.DataFrame(data)
> for write in ['fastparquet', 'pyarrow']:
> for read in ['fastparquet', 'pyarrow']:
> if os.path.exists(fname):
> os.remove(fname)
> df.to_parquet(fname, engine=write, compression=None)
> df_read = pd.read_parquet(fname, engine=read)
> print()
> print('write:', write, 'read:', read)
> for t in data.keys():
> print(t, df[t].dtype == df_read[t].dtype){code}
>
>
> {noformat}
> write: fastparquet read: fastparquet
> int True
> str True
> write: fastparquet read: pyarrow
> int False
> str False
> write: pyarrow read: fastparquet
> int True
> str True
> write: pyarrow read: pyarrow
> int False
> str True
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)