Jared Weston created ARROW-17900:
------------------------------------
Summary: [Python] combine_chunks on DictionaryArray appears to be
broken
Key: ARROW-17900
URL: https://issues.apache.org/jira/browse/ARROW-17900
Project: Apache Arrow
Issue Type: Bug
Components: Python
Reporter: Jared Weston
Attachments: category_counts.py, test.parquet
Recently upgraded from pyarrow 4.0.1 to 9.0.0 and there appears to be a bug
when combining the chunks of a dictionary with multiple row groups. The
dictionary is a stringarray of categories.
It is worth noting here that each category is not present in each chunk. To me,
the issue appears to be that the category indices per chunk appear to be
incorrect when a category is missing from a chunk when they are combined
together. I assume this as counts for the categories of a lower index (0, 1)
appear to be more frequent in the bugged version compared to the working
version, and the counts of the lower indices (2, 3, 4) are lower.
The difference can be easily noted when running a value count. For example;
!two.png!
A workaround for now is to read directly as a string array, and then encode
this as a dictionary. This isn't the best however due to speed and memory
concerns.
!one.png!
Attached is my table (I did not create this - so excuse the data / uuid style
column names) and a script to see the difference. Please run this with pyarrow
4.0.1 and pyarrow 9.0.0 to see the difference in output.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)