[jira] [Commented] (ARROW-9801) DictionaryArray with non-unique values are silently corrupted when written to a Parquet file

Joris Van den Bossche (Jira) Thu, 22 Oct 2020 02:23:59 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-9801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17218893#comment-17218893
 ]


Joris Van den Bossche commented on ARROW-9801:
----------------------------------------------

[~jpivarski] sorry for the late reply here (and yes, the JIRA editor is not 
great ...)

Now, in the mean time, this is actually fixed (it was also reported on the 
mailing list, and fixed in ARROW-10237), and the fix should be included in 
pyarrow 2.0 that is just released. 

Repeating your example with latest pyarrow, I get now:

{code}
In [32]: pa_loaded = pyarrow.parquet.read_table("tmp2.parquet")

In [33]: pa_loaded
Out[33]: 
pyarrow.Table
column: dictionary<values=string, indices=int32, ordered=0>

In [35]: pa_loaded.to_pydict()
Out[35]: 
{'column': ['one',
  'two',
  'three',
  'one',
  'two',
  'three',
  'one',
  'two',
  'three',
  'one',
  'two',
  'three']}

In [36]: pa_loaded["column"]
Out[36]: 
<pyarrow.lib.ChunkedArray object at 0x7fdbdb37e938>
[

  -- dictionary:
    [
      "one",
      "two",
      "three"
    ]
  -- indices:
    [
      0,
      1,
      2,
      0,
      1,
      2,
      0,
      1,
      2,
      0,
      1,
      2
    ]
]
{code}

So you can see that the dictionary values are still "minimized" (only the 
unique values are preserved), but the indices are now correctly translated for 
this new dictionary.

> DictionaryArray with non-unique values are silently corrupted when written to 
> a Parquet file
> --------------------------------------------------------------------------------------------
>
>                 Key: ARROW-9801
>                 URL: https://issues.apache.org/jira/browse/ARROW-9801
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.0
>         Environment: pyarrow 1.0.0 installed from conda-forge.
>            Reporter: Jim Pivarski
>            Priority: Major
>
> Suppose that you have a DictionaryArray with repeated values in the 
> dictionary:
> {{>>> import pyarrow as pa}}
> {{>>> pa_array = pa.DictionaryArray.from_arrays(}}
> {{...     pa.array([0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5]),}}
> {{...     pa.array(["one", "two", "three", "one", "two", "three"])}}
> {{... )}}
> {{>>> pa_array}}
> {{<pyarrow.lib.DictionaryArray object at 0x7f271befa4a0>}}{{-- dictionary:}}
> {{ [}}
> {{    "one",}}
> {{    }}{{"two",}}
> {{    }}{{"three",}}
> {{    }}{{"one",}}
> {{    }}{{"two",}}
> {{    }}{{"three"}}
> {{ ]}}
> {{-- indices:}}
> {{ [}}
> {{    }}{{0,}}
> {{    }}{{1,}}
> {{    }}{{2,}}
> {{    }}{{3,}}
> {{    }}{{4,}}
> {{    }}{{5,}}
> {{    }}{{0,}}
> {{    }}{{1,}}
> {{    }}{{2,}}
> {{    }}{{3,}}
> {{    }}{{4,}}
> {{    }}{{5}}
> {{ ]}}
> According to [the 
> documentation|[https://arrow.apache.org/docs/format/Columnar.html#dictionary-encoded-layout]],
> {quote}Dictionary encoding is a data representation technique to represent 
> values by integers referencing a *dictionary* usually consisting of unique 
> values.
> {quote}
> so a DictionaryArray like the one above is arguably invalid, but if so, then 
> I'd expect some error messages, rather than corrupt data, when I try to write 
> it to a Parquet file.
> {{>>> pa_table = pa.Table.from_batches(}}
> {{...     [pa.RecordBatch.from_arrays([pa_array], ["column"])]}}
> {{... )}}
> {{>>> pa_table}}
> {{pyarrow.Table}}
> {{column: dictionary<values=string, indices=int64, ordered=0>}}
> {{>>> import pyarrow.parquet}}
> {{>>> pyarrow.parquet.write_table(pa_table, "tmp2.parquet")}}
> No errors so far. So we try to read it back and view it:
> {{>>> pa_loaded = pyarrow.parquet.read_table("tmp2.parquet")}}
> {{>>> pa_loaded}}
> {{pyarrow.Table}}
> {{column: dictionary<values=string, indices=int32, ordered=0>}}
> {{>>> pa_loaded.to_pydict()}}
> {{Traceback (most recent call last):}}
> {{ File "<stdin>", line 1, in <module>}}
> {{ File "pyarrow/table.pxi", line 1587, in pyarrow.lib.Table.to_pydict}}
> {{ File "pyarrow/table.pxi", line 405, in pyarrow.lib.ChunkedArray.to_pylist}}
> {{ File "pyarrow/array.pxi", line 1144, in pyarrow.lib.Array.to_pylist}}
> {{ File "pyarrow/scalar.pxi", line 712, in 
> pyarrow.lib.DictionaryScalar.as_py}}
> {{ File "pyarrow/scalar.pxi", line 701, in 
> pyarrow.lib.DictionaryScalar.value.__get__}}
> {{ File "pyarrow/error.pxi", line 122, in 
> pyarrow.lib.pyarrow_internal_check_status}}
> {{ File "pyarrow/error.pxi", line 111, in pyarrow.lib.check_status}}
> {{pyarrow.lib.ArrowIndexError: tried to refer to element 3 but array is only 
> 3 long}}
> Looking more closely at this, we see that the dictionary has been minimized 
> to include only unique values, but the indices haven't been correctly 
> translated:
> {{>>> pa_loaded["column"]}}
> {{<pyarrow.lib.ChunkedArray object at 0x7f0a8fb16a90>}}
> {{[}}
> {{    }}{{}}{{-- dictionary:}}
> {{    }}{{[}}
> {{    }}{{    }}{{"one",}}
> {{    }}{{    }}{{"two",}}
> {{    }}{{    }}{{"three"}}
> {{    }}{{]}}
> {{    }}{{-- indices:}}
> {{    }}{{[}}
> {{    }}{{    }}{{0,}}
> {{    }}{{    }}{{1,}}
> {{    }}{{    }}{{2,}}
> {{    }}{{    }}{{3,}}
> {{    }}{{    }}{{0,}}
> {{    }}{{    }}{{1,}}
> {{    }}{{    }}{{1,}}
> {{    }}{{    }}{{1,}}
> {{    }}{{    }}{{2,}}
> {{    }}{{    }}{{3,}}
> {{    }}{{    }}{{0,}}
> {{    }}{{    }}{{1}}
> {{    }}{{]}}
> {{]}}
> It looks like an attempt was made to minimize it, but the indices ought to be
> [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]
> I don't know what your preferred course of action is—adding an error message 
> or fixing the attempted conversion—but this is wrong. On my side, I'm adding 
> code to prevent the creation of non-unique values in DictionaryArrays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9801) DictionaryArray with non-unique values are silently corrupted when written to a Parquet file

Reply via email to