[
https://issues.apache.org/jira/browse/ARROW-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matt Jadczak updated ARROW-10246:
---------------------------------
Component/s: Python
C++
> [Python] Incorrect conversion of Arrow dictionary to Parquet dictionary when
> duplicate values are present
> ---------------------------------------------------------------------------------------------------------
>
> Key: ARROW-10246
> URL: https://issues.apache.org/jira/browse/ARROW-10246
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Reporter: Matt Jadczak
> Priority: Major
>
> Copying this from [the mailing
> list|https://lists.apache.org/thread.html/r8afb5aed3855e35fe03bd3a27f2c7e2177ed2825c5ad5f455b2c9078%40%3Cdev.arrow.apache.org%3E]
> We can observe the following odd behaviour when round-tripping data via
> parquet using pyarrow, when the data contains dictionary arrays with
> duplicate values.
>
> {code:java}
> import pyarrow as pa
> import pyarrow.parquet as pq
> my_table = pa.Table.from_batches(
> [
> pa.RecordBatch.from_arrays(
> [
> pa.array([0, 1, 2, 3, 4]),
> pa.DictionaryArray.from_arrays(
> pa.array([0, 1, 2, 3, 4]),
> pa.array(['a', 'd', 'c', 'd', 'e'])
> )
> ],
> names=['foo', 'bar']
> )
> ]
> )
> my_table.validate(full=True)
> pq.write_table(my_table, "foo.parquet")
> read_table = pq.ParquetFile("foo.parquet").read()
> read_table.validate(full=True)
> print(my_table.column(1).to_pylist())
> print(read_table.column(1).to_pylist())
> assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist()
> {code}
> Both tables pass full validation, yet the last three lines print:
> {code:java}
> ['a', 'd', 'c', 'd', 'e']
> ['a', 'd', 'c', 'e', 'a']
> Traceback (most recent call last):
> File
> "/home/ataylor/projects/dsg-python-dtcc-equity-kinetics/dsg/example.py", line
> 29, in <module>
> assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist()
> AssertionError{code}
> Which clearly doesn't look right!
>
> It seems to me that the reason this is happening is that when re-encoding an
> Arrow dictionary as a Parquet one, the function at
> [https://github.com/apache/arrow/blob/4bbb74713c6883e8523eeeb5ac80a1e1f8521674/cpp/src/parquet/encoding.cc#L773]
> is called to create a Parquet DictEncoder out of the Arrow dictionary data.
> This internally uses a map from value to index, and this map is constructed
> by continually calling GetOrInsert on a memo table. When called with
> duplicate values as in Al's example, the duplicates do not cause a new
> dictionary index to be allocated, but instead return the existing one (which
> is just ignored). However, the caller assumes that the resulting Parquet
> dictionary uses the exact same indices as the Arrow one, and proceeds to just
> copy the index data directly. In Al's example, this results in an invalid
> dictionary index being written (that it is somehow wrapped around when
> reading again, rather than crashing, is potentially a second bug).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)