Athanassios Hatzis created ARROW-9505:
-----------------------------------------
Summary: [Python] pa.struct() dictionary-encode not implemented
for decimal
Key: ARROW-9505
URL: https://issues.apache.org/jira/browse/ARROW-9505
Project: Apache Arrow
Issue Type: Improvement
Components: Python
Affects Versions: 0.17.1
Reporter: Athanassios Hatzis
Hi, in this PyArrow structured array
{code:java}
struct_array.slice(0,3)
Out[52]:
<pyarrow.lib.StructArray object at 0x7f92061e9dc0>
-- is_valid: all not null
-- child 0 type: int16
[
991,
992,
993
]
-- child 1 type: decimal(6, 3)
[
36.100,
42.300,
15.300
]
{code}
I have tried to apply dictionary_encode() method and I got back this error
{code:java}
struct_array.dictionary_encode()
File "<ipython-input-51-440741990dd7>", line 1, in <module>
struct_array.dictionary_encode()
File "pyarrow/array.pxi", line 750, in pyarrow.lib.Array.dictionary_encode
File "pyarrow/error.pxi", line 106, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: dictionary-encode not implemented for
struct<catpid: int16, catcost: decimal(6, 3)>
{code}
I know that it is possible to apply dictionary_encode() to each field of the
struct_array and you can create a RecordBatch from the dictionary encoded
fields of the array. So I am not sure why this functionality is not implemented.
I also noticed that there is a transformation RecordBatch.from_struct_array()
but I want the columns to be dictionary encoded and the only way to do this in
the current version is to process each field, column separately.
BTW: In my project I am addressing a basic problem which is how to transform
tuples from any database table to dictionary encoded columns of a PyArrow
RecordBatch (Table).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)