I haven't looked closely but it looks like a bug, can someone open a
JIRA issue and copy the reproducible example?

On Thu, Oct 8, 2020 at 10:57 AM Jadczak, Matt
<matt.jadc...@gsacapital.com> wrote:
>
> I am unsure if this behaviour is intended (and duplicate values should be 
> forbidden), but it seems to me that the reason this is happening is that when 
> re-encoding an Arrow dictionary as a Parquet one, the function at 
> https://github.com/apache/arrow/blob/4bbb74713c6883e8523eeeb5ac80a1e1f8521674/cpp/src/parquet/encoding.cc#L773
>  is called to create a Parquet DictEncoder out of the Arrow dictionary data. 
> This internally uses a map from value to index, and this map is constructed 
> by continually calling GetOrInsert on a memo table. When called with 
> duplicate values as in Al's example, the duplicates do not cause a new 
> dictionary index to be allocated, but instead return the existing one (which 
> is just ignored). However, the caller assumes that the resulting Parquet 
> dictionary uses the exact same indices as the Arrow one, and proceeds to just 
> copy the index data directly. In Al's example, this results in an invalid 
> dictionary index being written (that it is somehow wrapped around when 
> reading again, rather than crashing, is potentially a second bug).
>
> On 2020/10/08 15:04:22, Al Taylor <a...@googlemail.com.INVALID> wrote:
> > Hi,>
> >
> > I've found the following odd behaviour when round-tripping data via parquet 
> > using pyarrow, when the data contains dictionary arrays with duplicate 
> > values.>
>
> >
> > ```python>
> >     import pyarrow as pa>
> >     import pyarrow.parquet as pq>
> >
> >     my_table = pa.Table.from_batches(>
> >         [>
> >             pa.RecordBatch.from_arrays(>
> >                 [>
> >                     pa.array([0, 1, 2, 3, 4]),>
> >                     pa.DictionaryArray.from_arrays(>
> >                         pa.array([0, 1, 2, 3, 4]),>
> >                         pa.array(['a', 'd', 'c', 'd', 'e'])>
> >                     )>
> >                 ],>
> >                 names=['foo', 'bar']>
> >             )>
> >         ]>
> >     )>
> >     my_table.validate(full=True)>
> >
> >     pq.write_table(my_table, "foo.parquet")>
> >
> >     read_table = pq.ParquetFile("foo.parquet").read()>
> >     read_table.validate(full=True)>
> >
> >     print(my_table.column(1).to_pylist())>
> >     print(read_table.column(1).to_pylist())>
> >
> >     assert my_table.column(1).to_pylist() == 
> > read_table.column(1).to_pylist()>
> > ```>
> >
> > Both tables pass full validation, yet the last three lines print:>
> > ```>
> > ['a', 'd', 'c', 'd', 'e']>
> > ['a', 'd', 'c', 'e', 'a']>
> > Traceback (most recent call last):>
> >   File 
> > "/home/ataylor/projects/dsg-python-dtcc-equity-kinetics/dsg/example.py", 
> > line 29, in <module>>
> >     assert my_table.column(1).to_pylist() == 
> > read_table.column(1).to_pylist()>
> > AssertionError>
> >
> > ```>
> >
> > Which clearly doesn't look right!>
> >
> > My question is whether I'm fundamentally breaking some assumption that 
> > dictionary values are unique or if there's a bug in the parquet-arrow 
> > conversion?>
>
> >
> > Thanks,>
> >
> > Al>
> >
>
> For details of how GSA uses your personal information, please see our Privacy 
> Notice here: https://www.gsacapital.com/privacy-notice
>
> This email and any files transmitted with it contain confidential and 
> proprietary information and is solely for the use of the intended recipient.
> If you are not the intended recipient please return the email to the sender 
> and delete it from your computer and you must not use, disclose, distribute, 
> copy, print or rely on this email or its contents.
> This communication is for informational purposes only.
> It is not intended as an offer or solicitation for the purchase or sale of 
> any financial instrument or as an official confirmation of any transaction.
> Any comments or statements made herein do not necessarily reflect those of 
> GSA Capital.
> GSA Capital Partners LLP is authorised and regulated by the Financial Conduct 
> Authority and is registered in England and Wales at Stratton House, 5 
> Stratton Street, London W1J 8LA, number OC309261.
> GSA Capital Services Limited is registered in England and Wales at the same 
> address, number 5320529.

Reply via email to