[
https://issues.apache.org/jira/browse/ARROW-13486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jorge Leitão updated ARROW-13486:
---------------------------------
Description:
When equating arrays, we use their semantics; i.e. we only care about the
things we see, and any in-memory details (such as values in the null slot) are
ignored. However, this does not seem to happen in dictionary arrays atm.
Specifically, the following does not pass:
{code:python}
import pyarrow as pa
indices = pa.array([0, 1, None])
dictionary = pa.array([None, "bar"])
dict_array1 = pa.DictionaryArray.from_arrays(indices, dictionary)
print(dict_array1.tolist())
# [None, "bar", None]
indices = pa.array([None, 1, 2])
dictionary = pa.array(["aa", "bar", None])
dict_array2 = pa.DictionaryArray.from_arrays(indices, dictionary)
print(dict_array2.tolist())
# [None, "bar", None]
assert dict_array1 == dict_array2
{code}
h2. Additional context
I found this while performing round-trips of dictionary arrays to and from
parquet. This happens because
1. we have two validities to worry (indices and values)
2. parquet does not support def levels in the dict page
To preserve both validities, we need to "AND" them and write them on the def
levels of the data page.
In this situation, even though the in-memory representation changes, the
semantic equality remains (we just make values have no nulls and move all the
nulls to the indices).
was:
When equating arrays, we use their semantics; i.e. we only care about the
things we see, and any in-memory details (such as values in the null slot) are
ignored. However, this does not seem to happen in dictionary arrays atm.
Specifically, the following does not pass:
{code:python}
import pyarrow as pa
indices = pa.array([0, 1, None])
dictionary = pa.array([None, "bar"])
dict_array1 = pa.DictionaryArray.from_arrays(indices, dictionary)
print(dict_array1.tolist())
# [None, "bar", None]
indices = pa.array([None, 1, 2])
dictionary = pa.array(["aa", "bar", None])
dict_array2 = pa.DictionaryArray.from_arrays(indices, dictionary)
print(dict_array2.tolist())
# [None, "bar", None]
assert dict_array1 == dict_array2
{code}
I found this while performing round-trips of dictionary arrays to and from
parquet. This happens because
1. we have two validities to worry (indices and values)
2. parquet does not support def levels in the dict page
To preserve both validities, we need to "AND" them and write them on the def
levels of the data page.
In this situation, even though the in-memory representation changes, the
semantic equality remains (we just make values have no nulls and move all the
nulls to the indices).
> [C++] [Python] Dictionary equality not correct?
> -----------------------------------------------
>
> Key: ARROW-13486
> URL: https://issues.apache.org/jira/browse/ARROW-13486
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Affects Versions: 5.0.0
> Reporter: Jorge Leitão
> Priority: Major
>
> When equating arrays, we use their semantics; i.e. we only care about the
> things we see, and any in-memory details (such as values in the null slot)
> are ignored. However, this does not seem to happen in dictionary arrays atm.
> Specifically, the following does not pass:
> {code:python}
> import pyarrow as pa
> indices = pa.array([0, 1, None])
> dictionary = pa.array([None, "bar"])
> dict_array1 = pa.DictionaryArray.from_arrays(indices, dictionary)
> print(dict_array1.tolist())
> # [None, "bar", None]
> indices = pa.array([None, 1, 2])
> dictionary = pa.array(["aa", "bar", None])
> dict_array2 = pa.DictionaryArray.from_arrays(indices, dictionary)
> print(dict_array2.tolist())
> # [None, "bar", None]
> assert dict_array1 == dict_array2
> {code}
> h2. Additional context
> I found this while performing round-trips of dictionary arrays to and from
> parquet. This happens because
> 1. we have two validities to worry (indices and values)
> 2. parquet does not support def levels in the dict page
> To preserve both validities, we need to "AND" them and write them on the def
> levels of the data page.
> In this situation, even though the in-memory representation changes, the
> semantic equality remains (we just make values have no nulls and move all the
> nulls to the indices).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)