[jira] [Updated] (ARROW-13486) [C++] [Python] Dictionary equality not correct?

Jira Tue, 24 Aug 2021 12:10:04 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-13486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jorge Leitão updated ARROW-13486:
---------------------------------
    Description: 
When equating arrays, we use their semantics; i.e. we only care about the 
things we see, and any in-memory details (such as values in the null slot) are 
ignored. However, this does not seem to happen in dictionary arrays atm.

Specifically, the following does not pass:

{code:python}
import pyarrow as pa

indices = pa.array([0, 1, None])
dictionary = pa.array([None, "bar"])
dict_array1 = pa.DictionaryArray.from_arrays(indices, dictionary)
print(dict_array1.tolist())
# [None, "bar", None]

indices = pa.array([None, 1, 2])
dictionary = pa.array(["aa", "bar", None])
dict_array2 = pa.DictionaryArray.from_arrays(indices, dictionary)
print(dict_array2.tolist())
# [None, "bar", None]

assert dict_array1 == dict_array2
{code}

h2. Additional context

I found this while performing round-trips of dictionary arrays to and from 
parquet. This happens because

1. we have two validities to worry (indices and values)
2. parquet does not support def levels in the dict page

To preserve both validities, we need to "AND" them and write them on the def 
levels of the data page.

In this situation, even though the in-memory representation changes, the 
semantic equality remains (we just make values have no nulls and move all the 
nulls to the indices).



  was:
When equating arrays, we use their semantics; i.e. we only care about the 
things we see, and any in-memory details (such as values in the null slot) are 
ignored. However, this does not seem to happen in dictionary arrays atm.

Specifically, the following does not pass:

{code:python}
import pyarrow as pa

indices = pa.array([0, 1, None])
dictionary = pa.array([None, "bar"])
dict_array1 = pa.DictionaryArray.from_arrays(indices, dictionary)
print(dict_array1.tolist())
# [None, "bar", None]

indices = pa.array([None, 1, 2])
dictionary = pa.array(["aa", "bar", None])
dict_array2 = pa.DictionaryArray.from_arrays(indices, dictionary)
print(dict_array2.tolist())
# [None, "bar", None]

assert dict_array1 == dict_array2
{code}

I found this while performing round-trips of dictionary arrays to and from 
parquet. This happens because

1. we have two validities to worry (indices and values)
2. parquet does not support def levels in the dict page

To preserve both validities, we need to "AND" them and write them on the def 
levels of the data page.

In this situation, even though the in-memory representation changes, the 
semantic equality remains (we just make values have no nulls and move all the 
nulls to the indices).




> [C++] [Python] Dictionary equality not correct?
> -----------------------------------------------
>
>                 Key: ARROW-13486
>                 URL: https://issues.apache.org/jira/browse/ARROW-13486
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 5.0.0
>            Reporter: Jorge Leitão
>            Priority: Major
>
> When equating arrays, we use their semantics; i.e. we only care about the 
> things we see, and any in-memory details (such as values in the null slot) 
> are ignored. However, this does not seem to happen in dictionary arrays atm.
> Specifically, the following does not pass:
> {code:python}
> import pyarrow as pa
> indices = pa.array([0, 1, None])
> dictionary = pa.array([None, "bar"])
> dict_array1 = pa.DictionaryArray.from_arrays(indices, dictionary)
> print(dict_array1.tolist())
> # [None, "bar", None]
> indices = pa.array([None, 1, 2])
> dictionary = pa.array(["aa", "bar", None])
> dict_array2 = pa.DictionaryArray.from_arrays(indices, dictionary)
> print(dict_array2.tolist())
> # [None, "bar", None]
> assert dict_array1 == dict_array2
> {code}
> h2. Additional context
> I found this while performing round-trips of dictionary arrays to and from 
> parquet. This happens because
> 1. we have two validities to worry (indices and values)
> 2. parquet does not support def levels in the dict page
> To preserve both validities, we need to "AND" them and write them on the def 
> levels of the data page.
> In this situation, even though the in-memory representation changes, the 
> semantic equality remains (we just make values have no nulls and move all the 
> nulls to the indices).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13486) [C++] [Python] Dictionary equality not correct?

Reply via email to