[
https://issues.apache.org/jira/browse/ARROW-12336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319513#comment-17319513
]
Joris Van den Bossche commented on ARROW-12336:
-----------------------------------------------
This is actually a bug in the pandas implementation, as it shouldn't assume
anything about the length of the buffer (the buffer size can be larger than
needed for the actual length of the array, even after taking into account the
offset).
I opened https://github.com/pandas-dev/pandas/issues/40896 for this.
[~ThomasBlauthQC] would you be interested in doing a PR for this? (in pandas)
Happy to provide some pointers to get started.
> [C++][Python] Empty Int64 array is of wrong size
> ------------------------------------------------
>
> Key: ARROW-12336
> URL: https://issues.apache.org/jira/browse/ARROW-12336
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Environment: macOS 10.15.7
> Arrow version: 3.1.0.dev578
> Reporter: Thomas Blauth
> Priority: Major
> Fix For: 4.0.0
>
>
> Setup:
> Table with Int64 and str columns; generated using the dataset api; filtered
> on str column.
>
> Bug Description:
> Calling {{table.to_pandas()}} fails due to an empty array of the ChunkedArray
> of the Int64 column. This empty array has a size of 4 Byte when using the
> arrow nightly builds and 0 Byte when using arrow 3.0.0.
> Note: The bug does not occur when the table only contains an Int64 column.
>
> Minimal example:
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet
> import pyarrow.dataset
> print("Arrow version: " + str(pa.__version__))
> print("---------------")
> # Only Int64 works fine
> df = pd.DataFrame({"Int_col": [1, 2, 10]}, dtype="Int64")
> table = pa.table(df)
> path_0 = "./test_0.parquet"
> pa.parquet.write_table(table, path_0)
> schema = pa.parquet.read_schema(path_0)
> ds = pa.dataset.FileSystemDataset.from_paths(
> paths=[path_0],
> filesystem=pa.fs.LocalFileSystem(),
> schema=schema,
> format=pa.dataset.ParquetFileFormat(),
> )
> table = ds.to_table(filter=(pa.dataset.field("Int_col") == 3))
> print("Size of array: " + str(table.column(0).nbytes))
> df = table.to_pandas()
> print("---------------")
> # Int64 and str crashes
> df = pd.DataFrame({"Int_col": [1, 2, 10], "str_col": ["A", "B", "Z"]})
> df = df.astype({"Int_col": "Int64"})
> table = pa.table(df)
> path_1 = "./test_1.parquet"
> pa.parquet.write_table(table, path_1)
> schema = pa.parquet.read_schema(path_1)
> ds = pa.dataset.FileSystemDataset.from_paths(
> paths=[path_1],
> filesystem=pa.fs.LocalFileSystem(),
> schema=schema,
> format=pa.dataset.ParquetFileFormat(),
> )
> table = ds.to_table(filter=(pa.dataset.field("str_col") == "C"))
> print("Size of array: " + str(table.column(0).nbytes))
> df = table.to_pandas()
> {code}
>
> Output :
> {code:bash}
> Arrow version: 3.1.0.dev578
> ---------------
> Size of array: 0
> ---------------
> Size of array: 4
> Traceback (most recent call last):
> File "/Users/xxx/empty_array_buffer_size.py", line 47, in <module>
> df = table.to_pandas()
> File "pyarrow/array.pxi", line 756, in
> pyarrow.lib._PandasConvertible.to_pandas
> File "pyarrow/table.pxi", line 1740, in pyarrow.lib.Table._to_pandas
> File
> "/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
> line 794, in table_to_blockmanager
> blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
> File
> "/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
> line 1135, in _table_to_blocks
> return [_reconstruct_block(item, columns, extension_columns)
> File
> "/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
> line 1135, in <listcomp>
> return [_reconstruct_block(item, columns, extension_columns)
> File
> "/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
> line 753, in _reconstruct_block
> pd_ext_arr = pandas_dtype.__from_arrow__(arr)
> File
> "/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pandas/core/arrays/integer.py",
> line 117, in __from_arrow__
> data, mask = pyarrow_array_to_numpy_and_mask(arr, dtype=self.type)
> File
> "/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pandas/core/arrays/_arrow_utils.py",
> line 32, in pyarrow_array_to_numpy_and_mask
> data = np.frombuffer(buflist[1], dtype=dtype)[arr.offset : arr.offset +
> len(arr)]
> ValueError: buffer size must be a multiple of element size
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)