[
https://issues.apache.org/jira/browse/ARROW-13413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17385708#comment-17385708
]
Weston Pace commented on ARROW-13413:
-------------------------------------
The key difference appears to be that tbl1 (the table created when converting
from pandas) is a table with 1 chunk with 0 values. Meanwhile, tbl2 (the table
created when reading an Arrow file) is a table with 0 chunks.
{code:java}
>>> len(tbl1.column('x').chunks)
1
>>> len(tbl2.column('x').chunks)
0
>>> len(tbl1.column('x').chunks[0])
0
{code}
I think that both are valid representations of an empty table. I'm not sure it
makes sense for Arrow to define one of these as more correct than the other. I
believe the correct fix would be for Pandas to be able to process either form.
Deleting the metadata presumably helps because then pandas no longer knows what
the data type is and the bug appears to be specific to the Int8 array parsing:
{code:java}
116 for arr in chunks:
117 data, mask = pyarrow_array_to_numpy_and_mask(arr,
dtype=self.type)
118 int_arr = IntegerArray(data.copy(), ~mask, copy=False)
119 results.append(int_arr)
120
121 -> return IntegerArray._concat_same_type(results)
{code}
Here you can see that `results` will be `[[]]` for `tbl1` and `[]` for `tbl2`.
> IPC roundtrip fails in to_pandas with empty table and extension type
> ---------------------------------------------------------------------
>
> Key: ARROW-13413
> URL: https://issues.apache.org/jira/browse/ARROW-13413
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Affects Versions: 4.0.1
> Reporter: Thomas Buhrmann
> Priority: Major
>
> With pyarrow=4.0.1 and pandas=1.2.3, when writing then reading an empty
> DataFrame with an extension dtype, `to_pandas` subsequently fails to convert
> the arrow table:
> {code:python}
> import pandas as pd
> import pyarrow as pa
> df1 = pd.DataFrame({"x": pd.Series([], dtype="Int8")})
> tbl1 = pa.Table.from_pandas(df1)
> # In memory roundtrip seems to work fine
> pa.Table.from_pandas(tbl1.to_pandas()).to_pandas()
> path = "/tmp/tmp.arr"
> writer = pa.RecordBatchStreamWriter(path, tbl1.schema)
> writer.write_table(tbl1)
> writer.close()
> reader = pa.RecordBatchStreamReader(path)
> tbl2 = reader.read_all()
> assert tbl1.schema.equals(tbl2.schema)
> assert tbl2.schema.metadata == tbl2.schema.metadata
> df2 = tbl1.to_pandas()
> try:
> df2 = tbl2.to_pandas()
> except Exception as e:
> print(f"Error: {e}")
> df2 = tbl2.replace_schema_metadata(None).to_pandas()
> {code}
> In the above example (with `Int8` as the pandas dtype), the table read from
> disk cannot be converted to a DataFrame, even though its schema and metadata
> are supposedly equal to the original table. Removing its metadata "fixes"
> the issue.
> The problem doesn't occur with "normal" dtypes. This may well be a bug in
> Pandas, but it seems to depend on some change in Arrow's metadata.
> The full stacktrace:
> {code:java}
> ---------------------------------------------------------------------------
> ValueError Traceback (most recent call last)
> <ipython-input-3-08855adb276d> in <module>
> ----> 1 df2 = tbl2.to_pandas()
> ~/miniforge3/envs/grapy/lib/python3.8/site-packages/pyarrow/array.pxi in
> pyarrow.lib._PandasConvertible.to_pandas()
> ~/miniforge3/envs/grapy/lib/python3.8/site-packages/pyarrow/table.pxi in
> pyarrow.lib.Table._to_pandas()
> ~/miniforge3/envs/grapy/lib/python3.8/site-packages/pyarrow/pandas_compat.py
> in table_to_blockmanager(options, table, categories, ignore_metadata,
> types_mapper)
> 787 _check_data_column_metadata_consistency(all_columns)
> 788 columns = _deserialize_column_index(table, all_columns,
> column_indexes)
> --> 789 blocks = _table_to_blocks(options, table, categories,
> ext_columns_dtypes)
> 790
> 791 axes = [columns, index]
> ~/miniforge3/envs/grapy/lib/python3.8/site-packages/pyarrow/pandas_compat.py
> in _table_to_blocks(options, block_table, categories, extension_columns)
> 1128 result = pa.lib.table_to_blocks(options, block_table, categories,
> 1129 list(extension_columns.keys()))
> -> 1130 return [_reconstruct_block(item, columns, extension_columns)
> 1131 for item in result]
> 1132
> ~/miniforge3/envs/grapy/lib/python3.8/site-packages/pyarrow/pandas_compat.py
> in <listcomp>(.0)
> 1128 result = pa.lib.table_to_blocks(options, block_table, categories,
> 1129 list(extension_columns.keys()))
> -> 1130 return [_reconstruct_block(item, columns, extension_columns)
> 1131 for item in result]
> 1132
> ~/miniforge3/envs/grapy/lib/python3.8/site-packages/pyarrow/pandas_compat.py
> in _reconstruct_block(item, columns, extension_columns)
> 747 raise ValueError("This column does not support to be
> converted "
> 748 "to a pandas ExtensionArray")
> --> 749 pd_ext_arr = pandas_dtype.__from_arrow__(arr)
> 750 block = _int.make_block(pd_ext_arr, placement=placement)
> 751 else:
> ~/miniforge3/envs/grapy/lib/python3.8/site-packages/pandas/core/arrays/integer.py
> in __from_arrow__(self, array)
> 119 results.append(int_arr)
> 120
> --> 121 return IntegerArray._concat_same_type(results)
> 122
> 123
> ~/miniforge3/envs/grapy/lib/python3.8/site-packages/pandas/core/arrays/masked.py
> in _concat_same_type(cls, to_concat)
> 269 cls: Type[BaseMaskedArrayT], to_concat:
> Sequence[BaseMaskedArrayT]
> 270 ) -> BaseMaskedArrayT:
> --> 271 data = np.concatenate([x._data for x in to_concat])
> 272 mask = np.concatenate([x._mask for x in to_concat])
> 273 return cls(data, mask)
> <__array_function__ internals> in concatenate(*args, **kwargs)
> ValueError: need at least one array to concatenate
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)