[ 
https://issues.apache.org/jira/browse/ARROW-17900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612230#comment-17612230
 ] 

Alenka Frim commented on ARROW-17900:
-------------------------------------

[~lidavidm] I _think_ this could be a bug in the {{Concatenate}} method in C++ 
introduced with version 6.0.0:
{code:python}
>>> pa.__version__
'6.0.0'
>>> da.value_counts()
<pyarrow.lib.StructArray object at 0x7faa4013fa60>
-- is_valid: all not null
-- child 0 type: dictionary<values=string, indices=int64, ordered=0>


  -- dictionary:
    [
      "MV",
      "OB",
      "LMS2"
    ]
  -- indices:
    [
      0,
      2,
      null
    ]
-- child 1 type: int64
  [
    12,
    3,
    1
  ] 
{code}
{code:python}
>>> pa.__version__
'5.0.0'
>>> da.value_counts()
<pyarrow.lib.StructArray object at 0x7fe8f827eb20>
-- is_valid: all not null
-- child 0 type: dictionary<values=string, indices=int64, ordered=0>

  -- dictionary:
    [
      "MV",
      "OB",
      "LMS2"
    ]
  -- indices:
    [
      0,
      1,
      2,
      null
    ]
-- child 1 type: int64
  [
    11,
    1,
    3,
    1
  ]

{code}
({{{}da{}}} being a StructArray created with combining chunks where each chunk 
is a DictionarryArray)

[https://github.com/apache/arrow/blob/f0303652b4934a9f767dca88268016c69375687d/python/pyarrow/array.pxi#L2946]

Can't find an open Jira issue for it.

 

> [Python] combine_chunks on DictionaryArray appears to be broken
> ---------------------------------------------------------------
>
>                 Key: ARROW-17900
>                 URL: https://issues.apache.org/jira/browse/ARROW-17900
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Jared Weston
>            Priority: Minor
>         Attachments: category_counts.py, one.png, test.parquet, two.png
>
>
> Recently upgraded from pyarrow 4.0.1 to 9.0.0 and there appears to be a bug 
> when combining the chunks of a dictionary with multiple row groups.  The 
> dictionary is a stringarray of categories.
> It is worth noting here that each category is not present in each chunk. To 
> me, the issue appears to be that the category indices per chunk appear to be 
> incorrect when a category is missing from a chunk when they are combined 
> together. I assume this as counts for the categories of a lower index (0, 1) 
> appear to be more frequent in the bugged version compared to the working 
> version, and the counts of the lower indices (2, 3, 4) are lower.
>  
> The difference can be easily noted when running a value count. For example;
> !two.png!
> A workaround for now is to read directly as a string array, and then encode 
> this as a dictionary. This isn't the best however due to speed and memory 
> concerns.
> !one.png!
>  
> Attached is my parquet file (test.parquet) and a simply python script to see 
> the difference (category_counts.py). I did not create this parquet file, 
> rather am consuming it from a service- so excuse the data / uuid style column 
> names. Please run this with pyarrow 4.0.1 and pyarrow 9.0.0 to see the 
> difference in output. The images say pyarrow 6.0.0 but the issue is still 
> present in 9.0.0. too
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to