Hi,

I've found the following odd behaviour when round-tripping data via parquet 
using pyarrow, when the data contains dictionary arrays with duplicate values.

```python
    import pyarrow as pa
    import pyarrow.parquet as pq

    my_table = pa.Table.from_batches(
        [
            pa.RecordBatch.from_arrays(
                [
                    pa.array([0, 1, 2, 3, 4]),
                    pa.DictionaryArray.from_arrays(
                        pa.array([0, 1, 2, 3, 4]),
                        pa.array(['a', 'd', 'c', 'd', 'e'])
                    )
                ],
                names=['foo', 'bar']
            )
        ]
    )
    my_table.validate(full=True)

    pq.write_table(my_table, "foo.parquet")

    read_table = pq.ParquetFile("foo.parquet").read()
    read_table.validate(full=True)

    print(my_table.column(1).to_pylist())
    print(read_table.column(1).to_pylist())

    assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist()
```

Both tables pass full validation, yet the last three lines print:
```
['a', 'd', 'c', 'd', 'e']
['a', 'd', 'c', 'e', 'a']
Traceback (most recent call last):
  File "/home/ataylor/projects/dsg-python-dtcc-equity-kinetics/dsg/example.py", 
line 29, in <module>
    assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist()
AssertionError

```

Which clearly doesn't look right!

My question is whether I'm fundamentally breaking some assumption that 
dictionary values are unique or if there's a bug in the parquet-arrow 
conversion?

Thanks,

Al

Reply via email to