[jira] [Commented] (ARROW-13413) IPC roundtrip fails in to_pandas with empty table and extension type

Weston Pace (Jira) Thu, 22 Jul 2021 11:39:06 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-13413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17385708#comment-17385708
 ]


Weston Pace commented on ARROW-13413:
-------------------------------------

The key difference appears to be that tbl1 (the table created when converting 
from pandas) is a table with 1 chunk with 0 values.  Meanwhile, tbl2 (the table 
created when reading an Arrow file) is a table with 0 chunks.

 
{code:java}
>>> len(tbl1.column('x').chunks)
1
>>> len(tbl2.column('x').chunks)
0
>>> len(tbl1.column('x').chunks[0])
0
{code}
I think that both are valid representations of an empty table.  I'm not sure it 
makes sense for Arrow to define one of these as more correct than the other.  I 
believe the correct fix would be for Pandas to be able to process either form.

 

Deleting the metadata presumably helps because then pandas no longer knows what 
the data type is and the bug appears to be specific to the Int8 array parsing:
{code:java}
116             for arr in chunks:
117                 data, mask = pyarrow_array_to_numpy_and_mask(arr, 
dtype=self.type)
118                 int_arr = IntegerArray(data.copy(), ~mask, copy=False)
119                 results.append(int_arr)
120     
121  ->         return IntegerArray._concat_same_type(results)

{code}
Here you can see that `results` will be `[[]]` for `tbl1` and `[]` for `tbl2`.

> IPC roundtrip fails in to_pandas with empty table and extension type 
> ---------------------------------------------------------------------
>
>                 Key: ARROW-13413
>                 URL: https://issues.apache.org/jira/browse/ARROW-13413
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 4.0.1
>            Reporter: Thomas Buhrmann
>            Priority: Major
>
> With pyarrow=4.0.1 and pandas=1.2.3, when writing then reading an empty 
> DataFrame with an extension dtype, `to_pandas` subsequently fails to convert 
> the arrow table:
> {code:python}
> import pandas as pd
> import pyarrow as pa
> df1 = pd.DataFrame({"x": pd.Series([], dtype="Int8")})
> tbl1 = pa.Table.from_pandas(df1)
> # In memory roundtrip seems to work fine
> pa.Table.from_pandas(tbl1.to_pandas()).to_pandas()
> path = "/tmp/tmp.arr"
> writer = pa.RecordBatchStreamWriter(path, tbl1.schema)
> writer.write_table(tbl1)
> writer.close()
> reader = pa.RecordBatchStreamReader(path)
> tbl2 = reader.read_all()
> assert tbl1.schema.equals(tbl2.schema)
> assert tbl2.schema.metadata == tbl2.schema.metadata
> df2 = tbl1.to_pandas()
> try:
>     df2 = tbl2.to_pandas()
> except Exception as e:
>     print(f"Error: {e}")
>     df2 = tbl2.replace_schema_metadata(None).to_pandas()
> {code}
> In the above example (with `Int8` as the pandas dtype), the table read from 
> disk cannot be converted to a DataFrame, even though its schema and metadata 
> are supposedly equal  to the original table. Removing its metadata "fixes" 
> the issue.
> The problem doesn't occur with "normal" dtypes. This may well be a bug in 
> Pandas, but it seems to depend on some change in Arrow's metadata.
> The full stacktrace:
> {code:java}
> ---------------------------------------------------------------------------
> ValueError                                Traceback (most recent call last)
> <ipython-input-3-08855adb276d> in <module>
> ----> 1 df2 = tbl2.to_pandas()
> ~/miniforge3/envs/grapy/lib/python3.8/site-packages/pyarrow/array.pxi in 
> pyarrow.lib._PandasConvertible.to_pandas()
> ~/miniforge3/envs/grapy/lib/python3.8/site-packages/pyarrow/table.pxi in 
> pyarrow.lib.Table._to_pandas()
> ~/miniforge3/envs/grapy/lib/python3.8/site-packages/pyarrow/pandas_compat.py 
> in table_to_blockmanager(options, table, categories, ignore_metadata, 
> types_mapper)
>     787     _check_data_column_metadata_consistency(all_columns)
>     788     columns = _deserialize_column_index(table, all_columns, 
> column_indexes)
> --> 789     blocks = _table_to_blocks(options, table, categories, 
> ext_columns_dtypes)
>     790 
>     791     axes = [columns, index]
> ~/miniforge3/envs/grapy/lib/python3.8/site-packages/pyarrow/pandas_compat.py 
> in _table_to_blocks(options, block_table, categories, extension_columns)
>    1128     result = pa.lib.table_to_blocks(options, block_table, categories,
>    1129                                     list(extension_columns.keys()))
> -> 1130     return [_reconstruct_block(item, columns, extension_columns)
>    1131             for item in result]
>    1132 
> ~/miniforge3/envs/grapy/lib/python3.8/site-packages/pyarrow/pandas_compat.py 
> in <listcomp>(.0)
>    1128     result = pa.lib.table_to_blocks(options, block_table, categories,
>    1129                                     list(extension_columns.keys()))
> -> 1130     return [_reconstruct_block(item, columns, extension_columns)
>    1131             for item in result]
>    1132 
> ~/miniforge3/envs/grapy/lib/python3.8/site-packages/pyarrow/pandas_compat.py 
> in _reconstruct_block(item, columns, extension_columns)
>     747             raise ValueError("This column does not support to be 
> converted "
>     748                              "to a pandas ExtensionArray")
> --> 749         pd_ext_arr = pandas_dtype.__from_arrow__(arr)
>     750         block = _int.make_block(pd_ext_arr, placement=placement)
>     751     else:
> ~/miniforge3/envs/grapy/lib/python3.8/site-packages/pandas/core/arrays/integer.py
>  in __from_arrow__(self, array)
>     119             results.append(int_arr)
>     120 
> --> 121         return IntegerArray._concat_same_type(results)
>     122 
>     123 
> ~/miniforge3/envs/grapy/lib/python3.8/site-packages/pandas/core/arrays/masked.py
>  in _concat_same_type(cls, to_concat)
>     269         cls: Type[BaseMaskedArrayT], to_concat: 
> Sequence[BaseMaskedArrayT]
>     270     ) -> BaseMaskedArrayT:
> --> 271         data = np.concatenate([x._data for x in to_concat])
>     272         mask = np.concatenate([x._mask for x in to_concat])
>     273         return cls(data, mask)
> <__array_function__ internals> in concatenate(*args, **kwargs)
> ValueError: need at least one array to concatenate
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-13413) IPC roundtrip fails in to_pandas with empty table and extension type

Reply via email to