Nick Radcliffe created ARROW-9880: ------------------------------------- Summary: Lose access to indices & dictionary roundtripping DictionaryArray to parquet file Key: ARROW-9880 URL: https://issues.apache.org/jira/browse/ARROW-9880 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 1.0.1 Environment: Mac running macOS Catalina (10.15.2), Python 3.7.6. Reporter: Nick Radcliffe Attachments: pyarraw_dictionaryarray_bug.py
I am in the process of adding support for reading/writing Parquet to a data analysis tool (Miró: [https://stochasticsolutions.com/miro/).] The tool has a string column type that is extremely close to PyArrow's DictionaryArray, so it was natural to add support for that, but round-tripping doesn't seem to work, as this example shows: The code creates writes a table with single column, a dictionary array, and writes it as a parquet file using `write_table`. On reading it back in, the column's `.type` indicates that it's a DictionaryArray, but Python reports its type as a `ChunkedArray`. Either way, it doesn't seem to have `indices` or `dictionary` properties. `to_pylist` works, so I can get the data in, but almost all the benefit of writing as a dictionary array is lost if I need to convert it to a Python list to access its values. I presume it isn't supposed to be like this. {code:python} $ python3 Python 3.7.6 (v3.7.6:43364a7ae0, Dec 18 2019, 14:18:50) [Clang 6.0 (clang-600.0.57)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import pyarrow as pa >>> import pyarrow.parquet as pq >>> print('PyArrow version:', pa.__version__) PyArrow version: 1.0.1 >>> >>> >>> dictionary = ['zero', 'one', 'two'] >>> indices = [None, 0, 1, 2, 0, 1, 0] >>> >>> col = pa.DictionaryArray.from_arrays(indices, dictionary) >>> print('col:', col) col: -- dictionary: [ "zero", "one", "two" ] -- indices: [ null, 0, 1, 2, 0, 1, 0 ] >>> print('col.to_pylist():', col.to_pylist()) col.to_pylist(): [None, 'zero', 'one', 'two', 'zero', 'one', 'zero'] >>> print('col.type:', col.type) col.type: dictionary<values=string, indices=int64, ordered=0> >>> print('type(col):', type(col)) type(col): <class 'pyarrow.lib.DictionaryArray'> >>> print('col.indices:', col.indices) col.indices: [ null, 0, 1, 2, 0, 1, 0 ] >>> print('col.dictionary:', col.dictionary) col.dictionary: [ "zero", "one", "two" ] >>> >>> path = '/tmp/zot.parquet' >>> pq.write_table(pa.lib.Table.from_pydict({'zot': col}), path) >>> table = pq.read_table(path) >>> >>> zot = table['zot'] >>> print('zot:', zot) zot: [ -- dictionary: [ "zero", "one", "two" ] -- indices: [ null, 0, 1, 2, 0, 1, 0 ] ] >>> print('zot.to_pylist():', zot.to_pylist()) zot.to_pylist(): [None, 'zero', 'one', 'two', 'zero', 'one', 'zero'] >>> print('zot.type:', zot.type) zot.type: dictionary<values=string, indices=int32, ordered=0> >>> print('type(zot):', type(zot)) type(zot): <class 'pyarrow.lib.ChunkedArray'> >>> print('zot.indices:', zot.indices) Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'pyarrow.lib.ChunkedArray' object has no attribute 'indices' >>> print('zot.dictionary:', zot.dictionary) Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'pyarrow.lib.ChunkedArray' object has no attribute 'dictionary' >>> ^D {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)