Jorge Leitão created ARROW-13486:
------------------------------------
Summary: [C++] [Python] Dictionary equality not correct?
Key: ARROW-13486
URL: https://issues.apache.org/jira/browse/ARROW-13486
Project: Apache Arrow
Issue Type: Bug
Components: C++
Affects Versions: 5.0.0
Reporter: Jorge Leitão
When equating arrays, we use their semantics; i.e. we only care about the
things we see, and any in-memory details (such as values in the null slot) are
ignored. However, this does not seem to happen in dictionary arrays atm.
Specifically, the following does not pass:
{code:python}
import pyarrow as pa
indices = pa.array([0, 1, None])
dictionary = pa.array([None, "bar"])
dict_array1 = pa.DictionaryArray.from_arrays(indices, dictionary)
print(dict_array1.tolist())
# [None, "bar", None]
indices = pa.array([None, 1, 2])
dictionary = pa.array(["aa", "bar", None])
dict_array2 = pa.DictionaryArray.from_arrays(indices, dictionary)
print(dict_array2.tolist())
# [None, "bar", None]
assert dict_array1 == dict_array2
{code}
I found this while performing round-trips of dictionary arrays to and from
parquet (in arrow2). This happens because
1. we have two validities to worry (indices and values)
2. parquet does not support def levels in the dict page
To preserve both validities, we need to "AND" them and write them on the def
levels of the data page.
In this situation, even though the in-memory representation changes, the
semantic equality remains (we just make values have no nulls and move all the
nulls to the indices).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)