[jira] [Created] (ARROW-16905) Table.to_pandas() fails for dictionary encoded columns with an is_null partition_expression

Thomas Newton (Jira) Sat, 25 Jun 2022 13:25:05 -0700

Thomas Newton created ARROW-16905:
-------------------------------------

             Summary: Table.to_pandas() fails for dictionary encoded columns 
with an is_null partition_expression
                 Key: ARROW-16905
                 URL: https://issues.apache.org/jira/browse/ARROW-16905
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
    Affects Versions: 8.0.0
         Environment: Ubuntu 18.04, PyArrow 8.0.0, Pandas 1.4.3
            Reporter: Thomas Newton
         Attachments: reproduce_null_dictionary_issue.zip

Minimal steps to reproduce:

I attached a `.zip` file containing a python script and a test parquet file.
Running this python script reproduces the issue.

The steps taken to reproduce:
# Create a test parquet file with one column containing only null.
# Create a parquet fragment from this file adding a `partition_expression`
with an `is_null` guarantee on this fragment.
# Create a `FileSystemDataset` from this fragment setting the schema to be a
dictionary column.
# Call `.to_table().to_pandas()` on the resulting pyarrow dataset. You will
get the following error.

{code:java}
File "/.../pip-core_pandas/pandas/core/dtypes/dtypes.py", line 492, in
validate_categories
raise ValueError("Categorical categories cannot be null")
ValueError: Categorical categories cannot be null {code}

My understanding of why this doesn't work:
# There are 2 ways of dictionary encoding nulls: `mask` and `encode` described
in the [pyarrow
docs|https://arrow.apache.org/docs/python/generated/pyarrow.compute.DictionaryEncodeOptions.html#pyarrow.compute.DictionaryEncodeOptions].
Pyarrow supports both but pandas categoricals only supports mask. Arguably the
real issue here is pandas should support `encode` style categoricals.
# When you provide an `.is_null` guarantee on a fragment arrow will not
actually read the data. It knows the type from the schema, we've guaranteed the
values are all null and it can get the length from the parquet metadata so it
has everything it needs.
# Instead of reading the data it uses the [Null
ArrayFactory|https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/util.cc].
For dictionary type columns I believe that calls [this DictionaryArray
constructor
|https://github.com/apache/arrow/blob/53752adc6b81166cd4ee7db5a819494042f29197/cpp/src/arrow/array/array_dict.cc#L80-L93]which
appears to be creating the dictionary in the `encode` style.

Would it be possible to make this configurable? It seems like the `mask` style
of dictionary encoding is the default for the rest of PyArrow and it would
solve the Pandas compatibility issue. I appreciate this is probably an
extremely niche issue but my options for a workaround are looking pretty
horrible.

--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (ARROW-16905) Table.to_pandas() fails for dictionary encoded columns with an is_null partition_expression

Reply via email to