[ 
https://issues.apache.org/jira/browse/ARROW-16231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524230#comment-17524230
 ] 

Joris Van den Bossche commented on ARROW-16231:
-----------------------------------------------

If I try to recreate this with a pure-pyarrow example, I get a different error:

 

{code}
import pyarrow as pa
from pyarrow.tests.test_extension_type import MyStructType

struct_array = pa.StructArray.from_arrays(
    [pa.array([0, 1], type="int64"), pa.array([1, 2], type="int64")],
    names=["left", "right"])
mystruct_array = pa.ExtensionArray.from_storage(MyStructType(), struct_array)
dict_array = pa.DictionaryArray.from_arrays(pa.array([0, 1, 0]), mystruct_array)

# roundtrip through Feather
from pyarrow import feather
feather.write_feather(pa.table({'a': dict_array}), 
"test_dict_ext_nested.feather")
feather.read_table("test_dict_ext_nested.feather")
{code}

gives

{code}
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-df8b416670f4> in <module>
----> 1 feather.read_table("test_dict_ext_nested.feather")

~/scipy/repos/arrow/python/pyarrow/feather.py in read_table(source, columns, 
memory_map, use_threads)
    242     table : pyarrow.Table
    243     """
--> 244     reader = _feather.FeatherReader(
    245         source, use_memory_map=memory_map, use_threads=use_threads)
    246 

~/scipy/repos/arrow/python/pyarrow/_feather.pyx in 
pyarrow._feather.FeatherReader.__cinit__()
~/scipy/repos/arrow/python/pyarrow/error.pxi in 
pyarrow.lib.pyarrow_internal_check_status()
~/scipy/repos/arrow/python/pyarrow/types.pxi in 
pyarrow.lib.PyExtensionType.__arrow_ext_deserialize__()
TypeError: Expected storage type struct<left: int64, right: int64> but got 
dictionary<values=struct<left: int64, right: int64>, indices=int64, ordered=0>
{code}


> [C++][Python] IPC failure for dictionary with extension type with struct 
> storage type
> -------------------------------------------------------------------------------------
>
>                 Key: ARROW-16231
>                 URL: https://issues.apache.org/jira/browse/ARROW-16231
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Joris Van den Bossche
>            Priority: Major
>
> Report from [https://github.com/apache/arrow/issues/12899]
> Roundtripping through IPC/Feather using a dictionary type where the 
> dictionary is an extension type with a nested storage type fails. Writing 
> seems to work (but no idea if the written file is "correct", as trying to 
> read the schema gives an error), but reading it back fails with 
> {_}"ArrowInvalid: Ran out of field metadata, likely malformed"{_}.
> The original use case was from a pandas extension type (the pandas interval 
> dtype is mapped to an arrow extension type with a struct type as storage, and 
> in this case this interval type was further wrapped in a categorical 
> (dictionary) type). A pandas-based test that reproduces this (can be added 
> like this in {{{}test_feather.py{}}}):
> {code:python}
> @pytest.mark.pandas
> def test_dictionary_interval():
>     df = pd.DataFrame({'a': pd.cut(range(1, 10, 3), [-1, 5, 10])})
>     _check_pandas_roundtrip(df, version=2)
> {code}
> this gives:
> {code:java}
> $ pytest python/pyarrow/tests/test_feather.py::test_dictionary_interval
> ....
> ========================= FAILURES =================
> ____________ test_dictionary_interval _______________
> pyarrow/_feather.pyx:88: in pyarrow._feather.FeatherReader.read
> E   pyarrow.lib.ArrowInvalid: Ran out of field metadata, likely malformed
> E   ../src/arrow/ipc/reader.cc:266  GetFieldMetadata(field_index_++, out_)
> E   ../src/arrow/ipc/reader.cc:283  LoadCommon(type_id)
> E   ../src/arrow/ipc/reader.cc:324  Load(child_fields[i].get(), 
> parent->child_data[i].get())
> E   ../src/arrow/ipc/reader.cc:529  loader.Load(&field, column.get())
> E   ../src/arrow/ipc/reader.cc:1188  ReadRecordBatchInternal( 
> *message->metadata(), schema_, field_inclusion_mask_, context, reader.get())
> E   ../src/arrow/ipc/feather.cc:730  reader->ReadRecordBatch(i)
> pyarrow/error.pxi:100: ArrowInvalid
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to