[jira] [Comment Edited] (ARROW-13342) [Python] Categorical boolean column saved as regular boolean in parquet

Alessandro Molina (Jira) Fri, 16 Jul 2021 06:52:05 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-13342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382081#comment-17382081
 ]


Alessandro Molina edited comment on ARROW-13342 at 7/16/21, 1:51 PM:
---------------------------------------------------------------------

FYI, this seems to happen for numeric types. The same is true for {{int}} and 
{{double}}

{code}
import pyarrow as pa
import pyarrow.parquet

# Convert to arrow Table and save to disk
table = pa.Table.from_arrays([
    pa.DictionaryArray.from_arrays(indices=pa.array([0, 1, 1, 0, 1, 1, 0], 
type=pa.int8()), 
                                   dictionary=[10000000000000000, 
20000000000000000])
], names=["data"])
print(table)
pa.parquet.write_table(table, 'test.parquet')

# Reload data and convert back to pandas
table_rel = pa.parquet.read_table('test.parquet')
print(table_rel)
{code}

The exception seem to be {{unicode}} which preserves the dictionary form.  

In general, from my understanding, it seems arrow currently writes in 
dictionary form only the binary types ( 
https://github.com/apache/arrow/blob/14b75ee71d770ba86999e0e7a0e0b94629b91968/cpp/src/parquet/column_writer.cc#L1008
 )

This in general seems a fairly reasonable behaviour to me, because in general 
binary data/text is where you get most saving due to the delta in size between 
the data and the indices.

On the other side, I don't think that "preserving same exact type" that was 
provided to a writer is an expectation that can be always satisfied or that it 
makes sense to enforce. In some cases enforcing dictionary encoding because the 
input was dictionary encoded might lead to bigger parquet files, or for some 
types for example that might not even be possible, think of CSV or JSON, not 
all types can be represented in those formats and thus the data you read back 
might have a different type



was (Author: amol-):
FYI, this seems to happen for numeric types. The same is true for {{int}} and 
{{double}}

{code}
import pyarrow as pa
import pyarrow.parquet

# Convert to arrow Table and save to disk
table = pa.Table.from_arrays([
    pa.DictionaryArray.from_arrays(indices=pa.array([0, 1, 1, 0, 1, 1, 0], 
type=pa.int8()), 
                                   dictionary=[10000000000000000, 
20000000000000000])
], names=["data"])
print(table)
pa.parquet.write_table(table, 'test.parquet')

# Reload data and convert back to pandas
table_rel = pa.parquet.read_table('test.parquet')
print(table_rel)
{code}

The exception seem to be {{unicode}} which preserves the dictionary form.  

In general, from my understanding, it seems arrow currently writes in 
dictionary form only the binary types ( 
https://github.com/apache/arrow/blob/14b75ee71d770ba86999e0e7a0e0b94629b91968/cpp/src/parquet/column_writer.cc#L1008
 )

This in general seems a fairly reasonable behaviour to me, because in general 
binary data/text is where you get most saving due to the delta in size between 
the data and the indices.


> [Python] Categorical boolean column saved as regular boolean in parquet
> -----------------------------------------------------------------------
>
>                 Key: ARROW-13342
>                 URL: https://issues.apache.org/jira/browse/ARROW-13342
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Parquet, Python
>    Affects Versions: 4.0.1
>            Reporter: Joao Moreira
>            Priority: Major
>
> When saving a pandas dataframe to parquet, if there is a categorical column 
> where the categories are boolean, the column is saved as regular boolean.
> This causes an issue because, when reading back the parquet file, I expect 
> the column to still be categorical.
>  
> Reproducible example:
> {code:python}
> import pandas as pd
> import pyarrow
> # Create dataframe with boolean column that is then converted to categorical
> df = pd.DataFrame({'a': [True, True, False, True, False]})
> df['a'] = df['a'].astype('category')
> # Convert to arrow Table and save to disk
> table = pyarrow.Table.from_pandas(df)
> pyarrow.parquet.write_table(table, 'test.parquet')
> # Reload data and convert back to pandas
> table_rel = pyarrow.parquet.read_table('test.parquet')
> df_rel = table_rel.to_pandas()
> {code}
> The arrow {{table}} variable correctly converts the column to an arrow 
> {{DICTIONARY}} type:
> {noformat}
> >>> df['a']
> 0     True
> 1     True
> 2    False
> 3     True
> 4    False
> Name: a, dtype: category
> Categories (2, object): [False, True]
> >>>
> >>> table
> pyarrow.Table
> a: dictionary<values=bool, indices=int8, ordered=0>
> {noformat}
> However, the reloaded column is now a regular boolean:
> {noformat}
> >>> table_rel
> pyarrow.Table
> a: bool
> >>>
> >>> df_rel['a']
> 0     True
> 1     True
> 2    False
> 3     True
> 4    False
> Name: a, dtype: bool
> {noformat}
> I would have expected the column to be read back as categorical.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-13342) [Python] Categorical boolean column saved as regular boolean in parquet

Reply via email to