Thomas Newton created ARROW-16905:
-------------------------------------

             Summary: Table.to_pandas() fails for dictionary encoded columns 
with an is_null partition_expression
                 Key: ARROW-16905
                 URL: https://issues.apache.org/jira/browse/ARROW-16905
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
    Affects Versions: 8.0.0
         Environment: Ubuntu 18.04, PyArrow 8.0.0, Pandas 1.4.3
            Reporter: Thomas Newton
         Attachments: reproduce_null_dictionary_issue.zip

Minimal steps to reproduce:

I attached a `.zip` file containing a python script and a test parquet file. 
Running this python script reproduces the issue.

The steps taken to reproduce:
 # Create a test parquet file with one column containing only null.
 # Create a parquet fragment from this file adding a `partition_expression` 
with an `is_null` guarantee on this fragment.
 # Create a `FileSystemDataset` from this fragment setting the schema to be a 
dictionary column.
 # Call `.to_table().to_pandas()` on the resulting pyarrow dataset. You will 
get the following error.

{code:java}
  File "/.../pip-core_pandas/pandas/core/dtypes/dtypes.py", line 492, in 
validate_categories
    raise ValueError("Categorical categories cannot be null")
ValueError: Categorical categories cannot be null {code}
 

My understanding of why this doesn't work:
 # There are 2 ways of dictionary encoding nulls: `mask` and `encode` described 
in the [pyarrow 
docs|https://arrow.apache.org/docs/python/generated/pyarrow.compute.DictionaryEncodeOptions.html#pyarrow.compute.DictionaryEncodeOptions].
 Pyarrow supports both but pandas categoricals only supports mask. Arguably the 
real issue here is pandas should support `encode` style categoricals.
 # When you provide an `.is_null` guarantee on a fragment arrow will not 
actually read the data. It knows the type from the schema, we've guaranteed the 
values are all null and it can get the length from the parquet metadata so it 
has everything it needs.
 # Instead of reading the data it uses the [Null 
ArrayFactory|https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/util.cc].
 For dictionary type columns I believe that calls [this DictionaryArray 
constructor 
|https://github.com/apache/arrow/blob/53752adc6b81166cd4ee7db5a819494042f29197/cpp/src/arrow/array/array_dict.cc#L80-L93]which
 appears to be creating the dictionary in the `encode` style.

Would it be possible to make this configurable? It seems like the `mask` style 
of dictionary encoding is the default for the rest of PyArrow and it would 
solve the Pandas compatibility issue. I appreciate this is probably an 
extremely niche issue but my options for a workaround are looking pretty 
horrible. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to