[ 
https://issues.apache.org/jira/browse/ARROW-17900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jared Weston updated ARROW-17900:
---------------------------------
    Description: 
Recently upgraded from pyarrow 4.0.1 to 9.0.0 and there appears to be a bug 
when combining the chunks of a dictionary with multiple row groups.  The 
dictionary is a stringarray of categories.

It is worth noting here that each category is not present in each chunk. To me, 
the issue appears to be that the category indices per chunk appear to be 
incorrect when a category is missing from a chunk when they are combined 
together. I assume this as counts for the categories of a lower index (0, 1) 
appear to be more frequent in the bugged version compared to the working 
version, and the counts of the lower indices (2, 3, 4) are lower.

 

The difference can be easily noted when running a value count. For example;

!two.png!

A workaround for now is to read directly as a string array, and then encode 
this as a dictionary. This isn't the best however due to speed and memory 
concerns.

!one.png!

 

Attached is my parquet file (test.parquet) and a simply python script to see 
the difference (category_counts.py). I did not create this parquet file, rather 
am consuming it from a service- so excuse the data / uuid style column names. 
Please run this with pyarrow 4.0.1 and pyarrow 9.0.0 to see the difference in 
output.

 

  was:
Recently upgraded from pyarrow 4.0.1 to 9.0.0 and there appears to be a bug 
when combining the chunks of a dictionary with multiple row groups.  The 
dictionary is a stringarray of categories.

It is worth noting here that each category is not present in each chunk. To me, 
the issue appears to be that the category indices per chunk appear to be 
incorrect when a category is missing from a chunk when they are combined 
together. I assume this as counts for the categories of a lower index (0, 1) 
appear to be more frequent in the bugged version compared to the working 
version, and the counts of the lower indices (2, 3, 4) are lower.

 

The difference can be easily noted when running a value count. For example;

!two.png!

A workaround for now is to read directly as a string array, and then encode 
this as a dictionary. This isn't the best however due to speed and memory 
concerns.

!one.png!

 

Attached is my parquet file (test.parquet) and a simply python script to see 
the difference (category_counts.py). (I did not create this parquet file, 
rather am consuming it from a service- so excuse the data / uuid style column 
names) and a script to see the difference. Please run this with pyarrow 4.0.1 
and pyarrow 9.0.0 to see the difference in output.

 


> [Python] combine_chunks on DictionaryArray appears to be broken
> ---------------------------------------------------------------
>
>                 Key: ARROW-17900
>                 URL: https://issues.apache.org/jira/browse/ARROW-17900
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Jared Weston
>            Priority: Minor
>         Attachments: category_counts.py, one.png, test.parquet, two.png
>
>
> Recently upgraded from pyarrow 4.0.1 to 9.0.0 and there appears to be a bug 
> when combining the chunks of a dictionary with multiple row groups.  The 
> dictionary is a stringarray of categories.
> It is worth noting here that each category is not present in each chunk. To 
> me, the issue appears to be that the category indices per chunk appear to be 
> incorrect when a category is missing from a chunk when they are combined 
> together. I assume this as counts for the categories of a lower index (0, 1) 
> appear to be more frequent in the bugged version compared to the working 
> version, and the counts of the lower indices (2, 3, 4) are lower.
>  
> The difference can be easily noted when running a value count. For example;
> !two.png!
> A workaround for now is to read directly as a string array, and then encode 
> this as a dictionary. This isn't the best however due to speed and memory 
> concerns.
> !one.png!
>  
> Attached is my parquet file (test.parquet) and a simply python script to see 
> the difference (category_counts.py). I did not create this parquet file, 
> rather am consuming it from a service- so excuse the data / uuid style column 
> names. Please run this with pyarrow 4.0.1 and pyarrow 9.0.0 to see the 
> difference in output.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to