[ 
https://issues.apache.org/jira/browse/ARROW-9801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche closed ARROW-9801.
----------------------------------------
    Resolution: Duplicate

> DictionaryArray with non-unique values are silently corrupted when written to 
> a Parquet file
> --------------------------------------------------------------------------------------------
>
>                 Key: ARROW-9801
>                 URL: https://issues.apache.org/jira/browse/ARROW-9801
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.0
>         Environment: pyarrow 1.0.0 installed from conda-forge.
>            Reporter: Jim Pivarski
>            Priority: Major
>             Fix For: 2.0.0
>
>
> Suppose that you have a DictionaryArray with repeated values in the 
> dictionary:
> {{>>> import pyarrow as pa}}
> {{>>> pa_array = pa.DictionaryArray.from_arrays(}}
> {{...     pa.array([0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5]),}}
> {{...     pa.array(["one", "two", "three", "one", "two", "three"])}}
> {{... )}}
> {{>>> pa_array}}
> {{<pyarrow.lib.DictionaryArray object at 0x7f271befa4a0>}}{{-- dictionary:}}
> {{ [}}
> {{    "one",}}
> {{    }}{{"two",}}
> {{    }}{{"three",}}
> {{    }}{{"one",}}
> {{    }}{{"two",}}
> {{    }}{{"three"}}
> {{ ]}}
> {{-- indices:}}
> {{ [}}
> {{    }}{{0,}}
> {{    }}{{1,}}
> {{    }}{{2,}}
> {{    }}{{3,}}
> {{    }}{{4,}}
> {{    }}{{5,}}
> {{    }}{{0,}}
> {{    }}{{1,}}
> {{    }}{{2,}}
> {{    }}{{3,}}
> {{    }}{{4,}}
> {{    }}{{5}}
> {{ ]}}
> According to [the 
> documentation|[https://arrow.apache.org/docs/format/Columnar.html#dictionary-encoded-layout]],
> {quote}Dictionary encoding is a data representation technique to represent 
> values by integers referencing a *dictionary* usually consisting of unique 
> values.
> {quote}
> so a DictionaryArray like the one above is arguably invalid, but if so, then 
> I'd expect some error messages, rather than corrupt data, when I try to write 
> it to a Parquet file.
> {{>>> pa_table = pa.Table.from_batches(}}
> {{...     [pa.RecordBatch.from_arrays([pa_array], ["column"])]}}
> {{... )}}
> {{>>> pa_table}}
> {{pyarrow.Table}}
> {{column: dictionary<values=string, indices=int64, ordered=0>}}
> {{>>> import pyarrow.parquet}}
> {{>>> pyarrow.parquet.write_table(pa_table, "tmp2.parquet")}}
> No errors so far. So we try to read it back and view it:
> {{​>>> pa_loaded = pyarrow.parquet.read_table("tmp2.parquet")}}
> {{>>> pa_loaded}}
> {{pyarrow.Table}}
> {{column: dictionary<values=string, indices=int32, ordered=0>}}
> {{>>> pa_loaded.to_pydict()}}
> {{Traceback (most recent call last):}}
> {{ File "<stdin>", line 1, in <module>}}
> {{ File "pyarrow/table.pxi", line 1587, in pyarrow.lib.Table.to_pydict}}
> {{ File "pyarrow/table.pxi", line 405, in pyarrow.lib.ChunkedArray.to_pylist}}
> {{ File "pyarrow/array.pxi", line 1144, in pyarrow.lib.Array.to_pylist}}
> {{ File "pyarrow/scalar.pxi", line 712, in 
> pyarrow.lib.DictionaryScalar.as_py}}
> {{ File "pyarrow/scalar.pxi", line 701, in 
> pyarrow.lib.DictionaryScalar.value.__get__}}
> {{ File "pyarrow/error.pxi", line 122, in 
> pyarrow.lib.pyarrow_internal_check_status}}
> {{ File "pyarrow/error.pxi", line 111, in pyarrow.lib.check_status}}
> {{pyarrow.lib.ArrowIndexError: tried to refer to element 3 but array is only 
> 3 long}}
> Looking more closely at this, we see that the dictionary has been minimized 
> to include only unique values, but the indices haven't been correctly 
> translated:
> {{>>> pa_loaded["column"]}}
> {{<pyarrow.lib.ChunkedArray object at 0x7f0a8fb16a90>}}
> {{[}}
> {{    }}{{}}{{-- dictionary:}}
> {{    }}{{[}}
> {{    }}{{    }}{{"one",}}
> {{    }}{{    }}{{"two",}}
> {{    }}{{    }}{{"three"}}
> {{    }}{{]}}
> {{    }}{{-- indices:}}
> {{    }}{{[}}
> {{    }}{{    }}{{0,}}
> {{    }}{{    }}{{1,}}
> {{    }}{{    }}{{2,}}
> {{    }}{{    }}{{3,}}
> {{    }}{{    }}{{0,}}
> {{    }}{{    }}{{1,}}
> {{    }}{{    }}{{1,}}
> {{    }}{{    }}{{1,}}
> {{    }}{{    }}{{2,}}
> {{    }}{{    }}{{3,}}
> {{    }}{{    }}{{0,}}
> {{    }}{{    }}{{1}}
> {{    }}{{]}}
> {{]}}
> It looks like an attempt was made to minimize it, but the indices ought to be
> [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]
> I don't know what your preferred course of action is—adding an error message 
> or fixing the attempted conversion—but this is wrong. On my side, I'm adding 
> code to prevent the creation of non-unique values in DictionaryArrays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to