[
https://issues.apache.org/jira/browse/ARROW-16905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nicola Crane updated ARROW-16905:
---------------------------------
Summary: [Python] Table.to_pandas() fails for dictionary encoded columns
with an is_null partition_expression (was: Table.to_pandas() fails for
dictionary encoded columns with an is_null partition_expression)
> [Python] Table.to_pandas() fails for dictionary encoded columns with an
> is_null partition_expression
> ----------------------------------------------------------------------------------------------------
>
> Key: ARROW-16905
> URL: https://issues.apache.org/jira/browse/ARROW-16905
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Affects Versions: 8.0.0
> Environment: Ubuntu 18.04, PyArrow 8.0.0, Pandas 1.4.3
> Reporter: Thomas Newton
> Priority: Major
> Attachments: reproduce_null_dictionary_issue.zip
>
>
> Minimal steps to reproduce:
> I attached a `.zip` file containing a python script and a test parquet file.
> Running this python script reproduces the issue.
> The steps taken to reproduce:
> # Create a test parquet file with one column containing only null.
> # Create a parquet fragment from this file adding a `partition_expression`
> with an `is_null` guarantee on this fragment.
> # Create a `FileSystemDataset` from this fragment setting the schema to be a
> dictionary column.
> # Call `.to_table().to_pandas()` on the resulting pyarrow dataset. You will
> get the following error.
> {code:java}
> File "/.../pip-core_pandas/pandas/core/dtypes/dtypes.py", line 492, in
> validate_categories
> raise ValueError("Categorical categories cannot be null")
> ValueError: Categorical categories cannot be null {code}
>
> My understanding of why this doesn't work:
> # There are 2 ways of dictionary encoding nulls: `mask` and `encode`
> described in the [pyarrow
> docs|https://arrow.apache.org/docs/python/generated/pyarrow.compute.DictionaryEncodeOptions.html#pyarrow.compute.DictionaryEncodeOptions].
> Pyarrow supports both but pandas categoricals only supports mask. Arguably
> the real issue here is pandas should support `encode` style categoricals.
> # When you provide an `.is_null` guarantee on a fragment arrow will not
> actually read the data. It knows the type from the schema, we've guaranteed
> the values are all null and it can get the length from the parquet metadata
> so it has everything it needs.
> # Instead of reading the data it uses the [Null
> ArrayFactory|https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/util.cc].
> For dictionary type columns I believe that calls [this DictionaryArray
> constructor
> |https://github.com/apache/arrow/blob/53752adc6b81166cd4ee7db5a819494042f29197/cpp/src/arrow/array/array_dict.cc#L80-L93]which
> appears to be creating the dictionary in the `encode` style.
> Would it be possible to make this configurable? It seems like the `mask`
> style of dictionary encoding is the default for the rest of PyArrow and it
> would solve the Pandas compatibility issue. I appreciate this is probably an
> extremely niche issue but my options for a workaround are looking pretty
> horrible.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)