[
https://issues.apache.org/jira/browse/ARROW-17900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611468#comment-17611468
]
Alenka Frim commented on ARROW-17900:
-------------------------------------
Thank you for reporting this! I made a small reproducible example:
{code:python}
import pyarrow as pa
indices = pa.array([0, 1, 2, 0, 2, 0, None, 2])
dictionary = pa.array(["MV", "OB", "LMS2"])
dict_array = pa.DictionaryArray.from_arrays(indices, dictionary)
indices1 = pa.array([0, 0, 0, 0, 0, 0, 0, 0])
dictionary1 = pa.array(["MV","OB"])
dict_array1 = pa.DictionaryArray.from_arrays(indices1, dictionary1)
# ChunkedArray made from two separate
# DictionarryArray objects
ca = pa.chunked_array((
dict_array,
dict_array1
))
# Creating one DictionarryArray from a ChunkedArray
# where each chunk is a DictionarryArray
da = ca.combine_chunks(){code}
Researching the data in pyarrow 4.0.1:
{code:python}
>>> pa.__version__
'4.0.1'
>>> ca.value_counts()
<pyarrow.lib.StructArray object at 0x7fcc4083d280>
-- is_valid: all not null
-- child 0 type: dictionary<values=string, indices=int64, ordered=0>
-- dictionary:
[
"MV",
"OB",
"LMS2"
]
-- indices:
[
0,
1,
2,
null
]
-- child 1 type: int64
[
11,
1,
3,
1
]
>>> da.value_counts()
<pyarrow.lib.StructArray object at 0x7fcc4083d220>
-- is_valid: all not null
-- child 0 type: dictionary<values=string, indices=int64, ordered=0>
-- dictionary:
[
"MV",
"OB",
"LMS2"
]
-- indices:
[
0,
1,
2,
null
]
-- child 1 type: int64
[
11,
1,
3,
1
]
{code}
and in pyarrow 9.0.0:
{code:python}
>>> pa.__version__
'9.0.0'
>>> ca.value_counts()
<pyarrow.lib.StructArray object at 0x7fa4989877c0>
-- is_valid: all not null
-- child 0 type: dictionary<values=string, indices=int64, ordered=0>
-- dictionary:
[
"MV",
"OB",
"LMS2"
]
-- indices:
[
0,
1,
2,
null
]
-- child 1 type: int64
[
11,
1,
3,
1
]
>>> da.value_counts()
<pyarrow.lib.StructArray object at 0x7fa498987be0>
-- is_valid: all not null
-- child 0 type: dictionary<values=string, indices=int64, ordered=0>
-- dictionary:
[
"MV",
"OB",
"LMS2"
]
-- indices:
[
0,
2,
null
]
-- child 1 type: int64
[
12,
3,
1
]
{code}
> [Python] combine_chunks on DictionaryArray appears to be broken
> ---------------------------------------------------------------
>
> Key: ARROW-17900
> URL: https://issues.apache.org/jira/browse/ARROW-17900
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Reporter: Jared Weston
> Priority: Minor
> Attachments: category_counts.py, one.png, test.parquet, two.png
>
>
> Recently upgraded from pyarrow 4.0.1 to 9.0.0 and there appears to be a bug
> when combining the chunks of a dictionary with multiple row groups. The
> dictionary is a stringarray of categories.
> It is worth noting here that each category is not present in each chunk. To
> me, the issue appears to be that the category indices per chunk appear to be
> incorrect when a category is missing from a chunk when they are combined
> together. I assume this as counts for the categories of a lower index (0, 1)
> appear to be more frequent in the bugged version compared to the working
> version, and the counts of the lower indices (2, 3, 4) are lower.
>
> The difference can be easily noted when running a value count. For example;
> !two.png!
> A workaround for now is to read directly as a string array, and then encode
> this as a dictionary. This isn't the best however due to speed and memory
> concerns.
> !one.png!
>
> Attached is my parquet file (test.parquet) and a simply python script to see
> the difference (category_counts.py). I did not create this parquet file,
> rather am consuming it from a service- so excuse the data / uuid style column
> names. Please run this with pyarrow 4.0.1 and pyarrow 9.0.0 to see the
> difference in output. The images say pyarrow 6.0.0 but the issue is still
> present in 9.0.0. too
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)