Pac A. He created ARROW-11456:
---------------------------------
Summary: OSError: Capacity error: BinaryBuilder cannot reserve
space for more than 2147483646 child elements
Key: ARROW-11456
URL: https://issues.apache.org/jira/browse/ARROW-11456
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 3.0.0, 2.0.0
Environment: pyarrow 3.0.0 / 2.0.0
pandas 1.2.1
Reporter: Pac A. He
When reading a large parquet file, I have this error:
{noformat}
df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
File
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py",
line 459, in read_parquet
return impl.read(
File
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py",
line 221, in read
return self.api.parquet.read_table(
File
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line
1638, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
File
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line
327, in read
return self.reader.read_all(column_indices=column_indices,
File "pyarrow/_parquet.pyx", line 1126, in
pyarrow._parquet.ParquetReader.read_all
File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Capacity error: BinaryBuilder cannot reserve space for more than
2147483646 child elements, got 2147483648
{noformat}
Isn't pyarrow supposed to support large parquets? It let me write this file,
but now it doesn't let me read it back. I don't understand why arrow uses
[32-bit
computing|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] in
a 64-bit world.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)