[ 
https://issues.apache.org/jira/browse/ARROW-16905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-16905:
---------------------------------
    Summary: [Python] Table.to_pandas() fails for dictionary encoded columns 
with an is_null partition_expression  (was: Table.to_pandas() fails for 
dictionary encoded columns with an is_null partition_expression)

> [Python] Table.to_pandas() fails for dictionary encoded columns with an 
> is_null partition_expression
> ----------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-16905
>                 URL: https://issues.apache.org/jira/browse/ARROW-16905
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 8.0.0
>         Environment: Ubuntu 18.04, PyArrow 8.0.0, Pandas 1.4.3
>            Reporter: Thomas Newton
>            Priority: Major
>         Attachments: reproduce_null_dictionary_issue.zip
>
>
> Minimal steps to reproduce:
> I attached a `.zip` file containing a python script and a test parquet file. 
> Running this python script reproduces the issue.
> The steps taken to reproduce:
>  # Create a test parquet file with one column containing only null.
>  # Create a parquet fragment from this file adding a `partition_expression` 
> with an `is_null` guarantee on this fragment.
>  # Create a `FileSystemDataset` from this fragment setting the schema to be a 
> dictionary column.
>  # Call `.to_table().to_pandas()` on the resulting pyarrow dataset. You will 
> get the following error.
> {code:java}
>   File "/.../pip-core_pandas/pandas/core/dtypes/dtypes.py", line 492, in 
> validate_categories
>     raise ValueError("Categorical categories cannot be null")
> ValueError: Categorical categories cannot be null {code}
>  
> My understanding of why this doesn't work:
>  # There are 2 ways of dictionary encoding nulls: `mask` and `encode` 
> described in the [pyarrow 
> docs|https://arrow.apache.org/docs/python/generated/pyarrow.compute.DictionaryEncodeOptions.html#pyarrow.compute.DictionaryEncodeOptions].
>  Pyarrow supports both but pandas categoricals only supports mask. Arguably 
> the real issue here is pandas should support `encode` style categoricals.
>  # When you provide an `.is_null` guarantee on a fragment arrow will not 
> actually read the data. It knows the type from the schema, we've guaranteed 
> the values are all null and it can get the length from the parquet metadata 
> so it has everything it needs.
>  # Instead of reading the data it uses the [Null 
> ArrayFactory|https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/util.cc].
>  For dictionary type columns I believe that calls [this DictionaryArray 
> constructor 
> |https://github.com/apache/arrow/blob/53752adc6b81166cd4ee7db5a819494042f29197/cpp/src/arrow/array/array_dict.cc#L80-L93]which
>  appears to be creating the dictionary in the `encode` style.
> Would it be possible to make this configurable? It seems like the `mask` 
> style of dictionary encoding is the default for the rest of PyArrow and it 
> would solve the Pandas compatibility issue. I appreciate this is probably an 
> extremely niche issue but my options for a workaround are looking pretty 
> horrible. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to