[ 
https://issues.apache.org/jira/browse/ARROW-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Jadczak updated ARROW-10246:
---------------------------------
    Component/s: Python
                 C++

> [Python] Incorrect conversion of Arrow dictionary to Parquet dictionary when 
> duplicate values are present
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-10246
>                 URL: https://issues.apache.org/jira/browse/ARROW-10246
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>            Reporter: Matt Jadczak
>            Priority: Major
>
> Copying this from [the mailing 
> list|https://lists.apache.org/thread.html/r8afb5aed3855e35fe03bd3a27f2c7e2177ed2825c5ad5f455b2c9078%40%3Cdev.arrow.apache.org%3E]
> We can observe the following odd behaviour when round-tripping data via 
> parquet using pyarrow, when the data contains dictionary arrays with 
> duplicate values.
>  
> {code:java}
> import pyarrow as pa
>  import pyarrow.parquet as pq
> my_table = pa.Table.from_batches(
>  [
>  pa.RecordBatch.from_arrays(
>  [
>  pa.array([0, 1, 2, 3, 4]),
>  pa.DictionaryArray.from_arrays(
>  pa.array([0, 1, 2, 3, 4]),
>  pa.array(['a', 'd', 'c', 'd', 'e'])
>  )
>  ],
>  names=['foo', 'bar']
>  )
>  ]
>  )
>  my_table.validate(full=True)
> pq.write_table(my_table, "foo.parquet")
> read_table = pq.ParquetFile("foo.parquet").read()
>  read_table.validate(full=True)
> print(my_table.column(1).to_pylist())
>  print(read_table.column(1).to_pylist())
> assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist()
> {code}
> Both tables pass full validation, yet the last three lines print:
> {code:java}
> ['a', 'd', 'c', 'd', 'e']
> ['a', 'd', 'c', 'e', 'a']
> Traceback (most recent call last):
>  File 
> "/home/ataylor/projects/dsg-python-dtcc-equity-kinetics/dsg/example.py", line 
> 29, in <module>
>  assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist()
> AssertionError{code}
> Which clearly doesn't look right!
>  
> It seems to me that the reason this is happening is that when re-encoding an 
> Arrow dictionary as a Parquet one, the function at
> [https://github.com/apache/arrow/blob/4bbb74713c6883e8523eeeb5ac80a1e1f8521674/cpp/src/parquet/encoding.cc#L773]
> is called to create a Parquet DictEncoder out of the Arrow dictionary data. 
> This internally uses a map from value to index, and this map is constructed 
> by continually calling GetOrInsert on a memo table. When called with 
> duplicate values as in Al's example, the duplicates do not cause a new 
> dictionary index to be allocated, but instead return the existing one (which 
> is just ignored). However, the caller assumes that the resulting Parquet 
> dictionary uses the exact same indices as the Arrow one, and proceeds to just 
> copy the index data directly. In Al's example, this results in an invalid 
> dictionary index being written (that it is somehow wrapped around when 
> reading again, rather than crashing, is potentially a second bug).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to