Thomas Blauth created ARROW-12336:
-------------------------------------
Summary: [C++][Python] Empty Int64 array is of wrong size
Key: ARROW-12336
URL: https://issues.apache.org/jira/browse/ARROW-12336
Project: Apache Arrow
Issue Type: Bug
Components: C++, Python
Environment: macOS 10.15.7
Arrow version: 3.1.0.dev578
Reporter: Thomas Blauth
Setup:
Table with Int64 and str columns; generated using the dataset api; filtered on
str column.
Bug Description:
Calling {{table.to_pandas()}} fails due to an empty array of the ChunkedArray
of the Int64 column. This empty array has a size of 4 Byte when using the arrow
nightly builds and 0 Byte when using arrow 3.0.0.
Note: The bug does not occur when the table only contains an Int64 column.
Minimal example:
{code:python}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet
import pyarrow.dataset
print("Arrow version: " + str(pa.__version__))
print("---------------")
# Only Int64 works fine
df = pd.DataFrame({"Int_col": [1, 2, 10]}, dtype="Int64")
table = pa.table(df)
path_0 = "./test_0.parquet"
pa.parquet.write_table(table, path_0)
schema = pa.parquet.read_schema(path_0)
ds = pa.dataset.FileSystemDataset.from_paths(
paths=[path_0],
filesystem=pa.fs.LocalFileSystem(),
schema=schema,
format=pa.dataset.ParquetFileFormat(),
)
table = ds.to_table(filter=(pa.dataset.field("Int_col") == 3))
print("Size of array: " + str(table.column(0).nbytes))
df = table.to_pandas()
print("---------------")
# Int64 and str crashes
df = pd.DataFrame({"Int_col": [1, 2, 10], "str_col": ["A", "B", "Z"]})
df = df.astype({"Int_col": "Int64"})
table = pa.table(df)
path_1 = "./test_1.parquet"
pa.parquet.write_table(table, path_1)
schema = pa.parquet.read_schema(path_1)
ds = pa.dataset.FileSystemDataset.from_paths(
paths=[path_1],
filesystem=pa.fs.LocalFileSystem(),
schema=schema,
format=pa.dataset.ParquetFileFormat(),
)
table = ds.to_table(filter=(pa.dataset.field("str_col") == "C"))
print("Size of array: " + str(table.column(0).nbytes))
df = table.to_pandas()
{code}
Output :
{code:bash}
Arrow version: 3.1.0.dev578
---------------
Size of array: 0
---------------
Size of array: 4
Traceback (most recent call last):
File "/Users/xxx/empty_array_buffer_size.py", line 47, in <module>
df = table.to_pandas()
File "pyarrow/array.pxi", line 756, in
pyarrow.lib._PandasConvertible.to_pandas
File "pyarrow/table.pxi", line 1740, in pyarrow.lib.Table._to_pandas
File
"/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
line 794, in table_to_blockmanager
blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
File
"/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
line 1135, in _table_to_blocks
return [_reconstruct_block(item, columns, extension_columns)
File
"/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
line 1135, in <listcomp>
return [_reconstruct_block(item, columns, extension_columns)
File
"/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
line 753, in _reconstruct_block
pd_ext_arr = pandas_dtype.__from_arrow__(arr)
File
"/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pandas/core/arrays/integer.py",
line 117, in __from_arrow__
data, mask = pyarrow_array_to_numpy_and_mask(arr, dtype=self.type)
File
"/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pandas/core/arrays/_arrow_utils.py",
line 32, in pyarrow_array_to_numpy_and_mask
data = np.frombuffer(buflist[1], dtype=dtype)[arr.offset : arr.offset +
len(arr)]
ValueError: buffer size must be a multiple of element size
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)