[jira] [Created] (ARROW-13486) [C++] [Python] Dictionary equality not correct?

Jira Wed, 28 Jul 2021 20:24:07 -0700

Jorge Leitão created ARROW-13486:
------------------------------------

             Summary: [C++] [Python] Dictionary equality not correct?
                 Key: ARROW-13486
                 URL: https://issues.apache.org/jira/browse/ARROW-13486
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
    Affects Versions: 5.0.0
            Reporter: Jorge Leitão



When equating arrays, we use their semantics; i.e. we only care about the 
things we see, and any in-memory details (such as values in the null slot) are 
ignored. However, this does not seem to happen in dictionary arrays atm.

Specifically, the following does not pass:

{code:python}
import pyarrow as pa

indices = pa.array([0, 1, None])
dictionary = pa.array([None, "bar"])
dict_array1 = pa.DictionaryArray.from_arrays(indices, dictionary)
print(dict_array1.tolist())
# [None, "bar", None]

indices = pa.array([None, 1, 2])
dictionary = pa.array(["aa", "bar", None])
dict_array2 = pa.DictionaryArray.from_arrays(indices, dictionary)
print(dict_array2.tolist())
# [None, "bar", None]

assert dict_array1 == dict_array2
{code}

I found this while performing round-trips of dictionary arrays to and from 
parquet (in arrow2). This happens because

1. we have two validities to worry (indices and values)
2. parquet does not support def levels in the dict page

To preserve both validities, we need to "AND" them and write them on the def 
levels of the data page.

In this situation, even though the in-memory representation changes, the 
semantic equality remains (we just make values have no nulls and move all the 
nulls to the indices).





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-13486) [C++] [Python] Dictionary equality not correct?

Reply via email to