[
https://issues.apache.org/jira/browse/ARROW-9801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joris Van den Bossche updated ARROW-9801:
-----------------------------------------
Fix Version/s: 2.0.0
> DictionaryArray with non-unique values are silently corrupted when written to
> a Parquet file
> --------------------------------------------------------------------------------------------
>
> Key: ARROW-9801
> URL: https://issues.apache.org/jira/browse/ARROW-9801
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 1.0.0
> Environment: pyarrow 1.0.0 installed from conda-forge.
> Reporter: Jim Pivarski
> Priority: Major
> Fix For: 2.0.0
>
>
> Suppose that you have a DictionaryArray with repeated values in the
> dictionary:
> {{>>> import pyarrow as pa}}
> {{>>> pa_array = pa.DictionaryArray.from_arrays(}}
> {{... pa.array([0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5]),}}
> {{... pa.array(["one", "two", "three", "one", "two", "three"])}}
> {{... )}}
> {{>>> pa_array}}
> {{<pyarrow.lib.DictionaryArray object at 0x7f271befa4a0>}}{{-- dictionary:}}
> {{ [}}
> {{ "one",}}
> {{ }}{{"two",}}
> {{ }}{{"three",}}
> {{ }}{{"one",}}
> {{ }}{{"two",}}
> {{ }}{{"three"}}
> {{ ]}}
> {{-- indices:}}
> {{ [}}
> {{ }}{{0,}}
> {{ }}{{1,}}
> {{ }}{{2,}}
> {{ }}{{3,}}
> {{ }}{{4,}}
> {{ }}{{5,}}
> {{ }}{{0,}}
> {{ }}{{1,}}
> {{ }}{{2,}}
> {{ }}{{3,}}
> {{ }}{{4,}}
> {{ }}{{5}}
> {{ ]}}
> According to [the
> documentation|[https://arrow.apache.org/docs/format/Columnar.html#dictionary-encoded-layout]],
> {quote}Dictionary encoding is a data representation technique to represent
> values by integers referencing a *dictionary* usually consisting of unique
> values.
> {quote}
> so a DictionaryArray like the one above is arguably invalid, but if so, then
> I'd expect some error messages, rather than corrupt data, when I try to write
> it to a Parquet file.
> {{>>> pa_table = pa.Table.from_batches(}}
> {{... [pa.RecordBatch.from_arrays([pa_array], ["column"])]}}
> {{... )}}
> {{>>> pa_table}}
> {{pyarrow.Table}}
> {{column: dictionary<values=string, indices=int64, ordered=0>}}
> {{>>> import pyarrow.parquet}}
> {{>>> pyarrow.parquet.write_table(pa_table, "tmp2.parquet")}}
> No errors so far. So we try to read it back and view it:
> {{>>> pa_loaded = pyarrow.parquet.read_table("tmp2.parquet")}}
> {{>>> pa_loaded}}
> {{pyarrow.Table}}
> {{column: dictionary<values=string, indices=int32, ordered=0>}}
> {{>>> pa_loaded.to_pydict()}}
> {{Traceback (most recent call last):}}
> {{ File "<stdin>", line 1, in <module>}}
> {{ File "pyarrow/table.pxi", line 1587, in pyarrow.lib.Table.to_pydict}}
> {{ File "pyarrow/table.pxi", line 405, in pyarrow.lib.ChunkedArray.to_pylist}}
> {{ File "pyarrow/array.pxi", line 1144, in pyarrow.lib.Array.to_pylist}}
> {{ File "pyarrow/scalar.pxi", line 712, in
> pyarrow.lib.DictionaryScalar.as_py}}
> {{ File "pyarrow/scalar.pxi", line 701, in
> pyarrow.lib.DictionaryScalar.value.__get__}}
> {{ File "pyarrow/error.pxi", line 122, in
> pyarrow.lib.pyarrow_internal_check_status}}
> {{ File "pyarrow/error.pxi", line 111, in pyarrow.lib.check_status}}
> {{pyarrow.lib.ArrowIndexError: tried to refer to element 3 but array is only
> 3 long}}
> Looking more closely at this, we see that the dictionary has been minimized
> to include only unique values, but the indices haven't been correctly
> translated:
> {{>>> pa_loaded["column"]}}
> {{<pyarrow.lib.ChunkedArray object at 0x7f0a8fb16a90>}}
> {{[}}
> {{ }}{{}}{{-- dictionary:}}
> {{ }}{{[}}
> {{ }}{{ }}{{"one",}}
> {{ }}{{ }}{{"two",}}
> {{ }}{{ }}{{"three"}}
> {{ }}{{]}}
> {{ }}{{-- indices:}}
> {{ }}{{[}}
> {{ }}{{ }}{{0,}}
> {{ }}{{ }}{{1,}}
> {{ }}{{ }}{{2,}}
> {{ }}{{ }}{{3,}}
> {{ }}{{ }}{{0,}}
> {{ }}{{ }}{{1,}}
> {{ }}{{ }}{{1,}}
> {{ }}{{ }}{{1,}}
> {{ }}{{ }}{{2,}}
> {{ }}{{ }}{{3,}}
> {{ }}{{ }}{{0,}}
> {{ }}{{ }}{{1}}
> {{ }}{{]}}
> {{]}}
> It looks like an attempt was made to minimize it, but the indices ought to be
> [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]
> I don't know what your preferred course of action is—adding an error message
> or fixing the attempted conversion—but this is wrong. On my side, I'm adding
> code to prevent the creation of non-unique values in DictionaryArrays.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)