[ 
https://issues.apache.org/jira/browse/ARROW-13342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382271#comment-17382271
 ] 

Weston Pace commented on ARROW-13342:
-------------------------------------

Judging by ARROW-3246 I believe the original intent was to support more types 
than just the binary types.  Also, Arrow is supposed to be storing custom 
metadata (parquet:SCHEMA) in the parquet file with details on the original 
Arrow schema.  This way, if we need to store the data more efficiently then we 
can do so while still restoring the original types upon deserialization.  I 
checked and store_schema is set to true and it does properly decode and restore 
the original schema which does report the column "a" as dictionary<values=bool, 
indices=int8, ordered=0>.  I'm not sure if some linkage is broken or if it only 
tries to restore the metadata on binary types.

 

However, I think the resolution for this issue is (although maybe not easy), 
the existing mechanism should be extended to handle more types.

> [Python] Categorical boolean column saved as regular boolean in parquet
> -----------------------------------------------------------------------
>
>                 Key: ARROW-13342
>                 URL: https://issues.apache.org/jira/browse/ARROW-13342
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Parquet, Python
>    Affects Versions: 4.0.1
>            Reporter: Joao Moreira
>            Priority: Major
>
> When saving a pandas dataframe to parquet, if there is a categorical column 
> where the categories are boolean, the column is saved as regular boolean.
> This causes an issue because, when reading back the parquet file, I expect 
> the column to still be categorical.
>  
> Reproducible example:
> {code:python}
> import pandas as pd
> import pyarrow
> # Create dataframe with boolean column that is then converted to categorical
> df = pd.DataFrame({'a': [True, True, False, True, False]})
> df['a'] = df['a'].astype('category')
> # Convert to arrow Table and save to disk
> table = pyarrow.Table.from_pandas(df)
> pyarrow.parquet.write_table(table, 'test.parquet')
> # Reload data and convert back to pandas
> table_rel = pyarrow.parquet.read_table('test.parquet')
> df_rel = table_rel.to_pandas()
> {code}
> The arrow {{table}} variable correctly converts the column to an arrow 
> {{DICTIONARY}} type:
> {noformat}
> >>> df['a']
> 0     True
> 1     True
> 2    False
> 3     True
> 4    False
> Name: a, dtype: category
> Categories (2, object): [False, True]
> >>>
> >>> table
> pyarrow.Table
> a: dictionary<values=bool, indices=int8, ordered=0>
> {noformat}
> However, the reloaded column is now a regular boolean:
> {noformat}
> >>> table_rel
> pyarrow.Table
> a: bool
> >>>
> >>> df_rel['a']
> 0     True
> 1     True
> 2    False
> 3     True
> 4    False
> Name: a, dtype: bool
> {noformat}
> I would have expected the column to be read back as categorical.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to