This has been reported as https://issues.apache.org/jira/browse/ARROW-10237, and is in the meantime also already fixed.
Joris On Thu, 8 Oct 2020 at 18:20, Wes McKinney <wesmck...@gmail.com> wrote: > I haven't looked closely but it looks like a bug, can someone open a > JIRA issue and copy the reproducible example? > > On Thu, Oct 8, 2020 at 10:57 AM Jadczak, Matt > <matt.jadc...@gsacapital.com> wrote: > > > > I am unsure if this behaviour is intended (and duplicate values should > be forbidden), but it seems to me that the reason this is happening is that > when re-encoding an Arrow dictionary as a Parquet one, the function at > https://github.com/apache/arrow/blob/4bbb74713c6883e8523eeeb5ac80a1e1f8521674/cpp/src/parquet/encoding.cc#L773 > is called to create a Parquet DictEncoder out of the Arrow dictionary data. > This internally uses a map from value to index, and this map is constructed > by continually calling GetOrInsert on a memo table. When called with > duplicate values as in Al's example, the duplicates do not cause a new > dictionary index to be allocated, but instead return the existing one > (which is just ignored). However, the caller assumes that the resulting > Parquet dictionary uses the exact same indices as the Arrow one, and > proceeds to just copy the index data directly. In Al's example, this > results in an invalid dictionary index being written (that it is somehow > wrapped around when reading again, rather than crashing, is potentially a > second bug). > > > > On 2020/10/08 15:04:22, Al Taylor <a...@googlemail.com.INVALID> wrote: > > > Hi,> > > > > > > I've found the following odd behaviour when round-tripping data via > parquet using pyarrow, when the data contains dictionary arrays with > duplicate values.> > > > > > > > > ```python> > > > import pyarrow as pa> > > > import pyarrow.parquet as pq> > > > > > > my_table = pa.Table.from_batches(> > > > [> > > > pa.RecordBatch.from_arrays(> > > > [> > > > pa.array([0, 1, 2, 3, 4]),> > > > pa.DictionaryArray.from_arrays(> > > > pa.array([0, 1, 2, 3, 4]),> > > > pa.array(['a', 'd', 'c', 'd', 'e'])> > > > )> > > > ],> > > > names=['foo', 'bar']> > > > )> > > > ]> > > > )> > > > my_table.validate(full=True)> > > > > > > pq.write_table(my_table, "foo.parquet")> > > > > > > read_table = pq.ParquetFile("foo.parquet").read()> > > > read_table.validate(full=True)> > > > > > > print(my_table.column(1).to_pylist())> > > > print(read_table.column(1).to_pylist())> > > > > > > assert my_table.column(1).to_pylist() == > read_table.column(1).to_pylist()> > > > ```> > > > > > > Both tables pass full validation, yet the last three lines print:> > > > ```> > > > ['a', 'd', 'c', 'd', 'e']> > > > ['a', 'd', 'c', 'e', 'a']> > > > Traceback (most recent call last):> > > > File > "/home/ataylor/projects/dsg-python-dtcc-equity-kinetics/dsg/example.py", > line 29, in <module>> > > > assert my_table.column(1).to_pylist() == > read_table.column(1).to_pylist()> > > > AssertionError> > > > > > > ```> > > > > > > Which clearly doesn't look right!> > > > > > > My question is whether I'm fundamentally breaking some assumption that > dictionary values are unique or if there's a bug in the parquet-arrow > conversion?> > > > > > > > > Thanks,> > > > > > > Al> > > > > > > > For details of how GSA uses your personal information, please see our > Privacy Notice here: https://www.gsacapital.com/privacy-notice > > > > This email and any files transmitted with it contain confidential and > proprietary information and is solely for the use of the intended recipient. > > If you are not the intended recipient please return the email to the > sender and delete it from your computer and you must not use, disclose, > distribute, copy, print or rely on this email or its contents. > > This communication is for informational purposes only. > > It is not intended as an offer or solicitation for the purchase or sale > of any financial instrument or as an official confirmation of any > transaction. > > Any comments or statements made herein do not necessarily reflect those > of GSA Capital. > > GSA Capital Partners LLP is authorised and regulated by the Financial > Conduct Authority and is registered in England and Wales at Stratton House, > 5 Stratton Street, London W1J 8LA, number OC309261. > > GSA Capital Services Limited is registered in England and Wales at the > same address, number 5320529. >