[
https://issues.apache.org/jira/browse/ARROW-13342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382497#comment-17382497
]
Joris Van den Bossche commented on ARROW-13342:
-----------------------------------------------
Directly reading dictionary encoded data in Parquet as dictionary typed arrays
in Arrow is currently indeed only supported for BYTE_ARRAY storage (i.e.
string, binary). The issue tracking the follow-up to expand this support to
other data types is ARROW-6140.
The documentation of {{pq.read_table}} also somewhat mentions this:
{code}
read_dictionary : list, default None
List of names or column paths (for nested types) to read directly
as DictionaryArray. Only supported for BYTE_ARRAY storage.
...
{code}
This parameter allows to explicitly specify which columns to read as dictionary
arrays. But it doesn't mention the default behaviour, which is to infer those
columns from the stored arrow:SCHEMA.
> [Python] Categorical boolean column saved as regular boolean in parquet
> -----------------------------------------------------------------------
>
> Key: ARROW-13342
> URL: https://issues.apache.org/jira/browse/ARROW-13342
> Project: Apache Arrow
> Issue Type: Bug
> Components: Parquet, Python
> Affects Versions: 4.0.1
> Reporter: Joao Moreira
> Priority: Major
>
> When saving a pandas dataframe to parquet, if there is a categorical column
> where the categories are boolean, the column is saved as regular boolean.
> This causes an issue because, when reading back the parquet file, I expect
> the column to still be categorical.
>
> Reproducible example:
> {code:python}
> import pandas as pd
> import pyarrow
> # Create dataframe with boolean column that is then converted to categorical
> df = pd.DataFrame({'a': [True, True, False, True, False]})
> df['a'] = df['a'].astype('category')
> # Convert to arrow Table and save to disk
> table = pyarrow.Table.from_pandas(df)
> pyarrow.parquet.write_table(table, 'test.parquet')
> # Reload data and convert back to pandas
> table_rel = pyarrow.parquet.read_table('test.parquet')
> df_rel = table_rel.to_pandas()
> {code}
> The arrow {{table}} variable correctly converts the column to an arrow
> {{DICTIONARY}} type:
> {noformat}
> >>> df['a']
> 0 True
> 1 True
> 2 False
> 3 True
> 4 False
> Name: a, dtype: category
> Categories (2, object): [False, True]
> >>>
> >>> table
> pyarrow.Table
> a: dictionary<values=bool, indices=int8, ordered=0>
> {noformat}
> However, the reloaded column is now a regular boolean:
> {noformat}
> >>> table_rel
> pyarrow.Table
> a: bool
> >>>
> >>> df_rel['a']
> 0 True
> 1 True
> 2 False
> 3 True
> 4 False
> Name: a, dtype: bool
> {noformat}
> I would have expected the column to be read back as categorical.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)