Thomas Newton created ARROW-16905:
-------------------------------------
Summary: Table.to_pandas() fails for dictionary encoded columns
with an is_null partition_expression
Key: ARROW-16905
URL: https://issues.apache.org/jira/browse/ARROW-16905
Project: Apache Arrow
Issue Type: Bug
Components: C++
Affects Versions: 8.0.0
Environment: Ubuntu 18.04, PyArrow 8.0.0, Pandas 1.4.3
Reporter: Thomas Newton
Attachments: reproduce_null_dictionary_issue.zip
Minimal steps to reproduce:
I attached a `.zip` file containing a python script and a test parquet file.
Running this python script reproduces the issue.
The steps taken to reproduce:
# Create a test parquet file with one column containing only null.
# Create a parquet fragment from this file adding a `partition_expression`
with an `is_null` guarantee on this fragment.
# Create a `FileSystemDataset` from this fragment setting the schema to be a
dictionary column.
# Call `.to_table().to_pandas()` on the resulting pyarrow dataset. You will
get the following error.
{code:java}
File "/.../pip-core_pandas/pandas/core/dtypes/dtypes.py", line 492, in
validate_categories
raise ValueError("Categorical categories cannot be null")
ValueError: Categorical categories cannot be null {code}
My understanding of why this doesn't work:
# There are 2 ways of dictionary encoding nulls: `mask` and `encode` described
in the [pyarrow
docs|https://arrow.apache.org/docs/python/generated/pyarrow.compute.DictionaryEncodeOptions.html#pyarrow.compute.DictionaryEncodeOptions].
Pyarrow supports both but pandas categoricals only supports mask. Arguably the
real issue here is pandas should support `encode` style categoricals.
# When you provide an `.is_null` guarantee on a fragment arrow will not
actually read the data. It knows the type from the schema, we've guaranteed the
values are all null and it can get the length from the parquet metadata so it
has everything it needs.
# Instead of reading the data it uses the [Null
ArrayFactory|https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/util.cc].
For dictionary type columns I believe that calls [this DictionaryArray
constructor
|https://github.com/apache/arrow/blob/53752adc6b81166cd4ee7db5a819494042f29197/cpp/src/arrow/array/array_dict.cc#L80-L93]which
appears to be creating the dictionary in the `encode` style.
Would it be possible to make this configurable? It seems like the `mask` style
of dictionary encoding is the default for the rest of PyArrow and it would
solve the Pandas compatibility issue. I appreciate this is probably an
extremely niche issue but my options for a workaround are looking pretty
horrible.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)