Matt Jadczak created ARROW-10246:
------------------------------------
Summary: [Python] Incorrect conversion of Arrow dictionary to
Parquet dictionary when duplicate values are present
Key: ARROW-10246
URL: https://issues.apache.org/jira/browse/ARROW-10246
Project: Apache Arrow
Issue Type: Bug
Reporter: Matt Jadczak
Copying this from [the mailing
list|https://lists.apache.org/thread.html/r8afb5aed3855e35fe03bd3a27f2c7e2177ed2825c5ad5f455b2c9078%40%3Cdev.arrow.apache.org%3E]
We can observe the following odd behaviour when round-tripping data via parquet
using pyarrow, when the data contains dictionary arrays with duplicate values.
{code:java}
import pyarrow as pa
import pyarrow.parquet as pq
my_table = pa.Table.from_batches(
[
pa.RecordBatch.from_arrays(
[
pa.array([0, 1, 2, 3, 4]),
pa.DictionaryArray.from_arrays(
pa.array([0, 1, 2, 3, 4]),
pa.array(['a', 'd', 'c', 'd', 'e'])
)
],
names=['foo', 'bar']
)
]
)
my_table.validate(full=True)
pq.write_table(my_table, "foo.parquet")
read_table = pq.ParquetFile("foo.parquet").read()
read_table.validate(full=True)
print(my_table.column(1).to_pylist())
print(read_table.column(1).to_pylist())
assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist()
{code}
Both tables pass full validation, yet the last three lines print:
{code:java}
['a', 'd', 'c', 'd', 'e']
['a', 'd', 'c', 'e', 'a']
Traceback (most recent call last):
File "/home/ataylor/projects/dsg-python-dtcc-equity-kinetics/dsg/example.py",
line 29, in <module>
assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist()
AssertionError{code}
Which clearly doesn't look right!
It seems to me that the reason this is happening is that when re-encoding an
Arrow dictionary as a Parquet one, the function at
[https://github.com/apache/arrow/blob/4bbb74713c6883e8523eeeb5ac80a1e1f8521674/cpp/src/parquet/encoding.cc#L773]
is called to create a Parquet DictEncoder out of the Arrow dictionary data.
This internally uses a map from value to index, and this map is constructed by
continually calling GetOrInsert on a memo table. When called with duplicate
values as in Al's example, the duplicates do not cause a new dictionary index
to be allocated, but instead return the existing one (which is just ignored).
However, the caller assumes that the resulting Parquet dictionary uses the
exact same indices as the Arrow one, and proceeds to just copy the index data
directly. In Al's example, this results in an invalid dictionary index being
written (that it is somehow wrapped around when reading again, rather than
crashing, is potentially a second bug).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)