adamreeve opened a new issue, #35423:
URL: https://github.com/apache/arrow/issues/35423
### Describe the bug, including details regarding any error messages,
version, and platform.
Arrow 12.0.0 has a regression where it can crash when reading byte-stream
split encoded data written by itself or older versions of Arrow:
```python
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
x = pa.array(np.linspace(0.0, 1.0, 1_000_000), type=pa.float32())
table = pa.Table.from_arrays([x], names=['x'])
pq.write_table(table, 'data.parquet', use_dictionary=False,
use_byte_stream_split=True)
table = pq.read_table('data.parquet')
print(table)
```
This crashes with:
```
Traceback (most recent call last):
File "/home/.../write_read_data.py", line 9, in <module>
table = pq.read_table('data.parquet')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/home/adam/.local/share/virtualenvs/arrow12-nieuEBn0/lib64/python3.11/site-packages/pyarrow/parquet/core.py",
line 2986, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/home/adam/.local/share/virtualenvs/arrow12-nieuEBn0/lib64/python3.11/site-packages/pyarrow/parquet/core.py",
line 2614, in read
table = self._dataset.to_table(
^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_dataset.pyx", line 546, in pyarrow._dataset.Dataset.to_table
File "pyarrow/_dataset.pyx", line 3449, in
pyarrow._dataset.Scanner.to_table
File "pyarrow/error.pxi", line 144, in
pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: Data size too large for number of values (padding in byte stream
split data page?)
```
But the above code works fine with pyarrow 11.0.0 and and 10.0.1.
It appears that #34140 caused this regression. I tested building pyarrow on
the current main branch (commit 42d42b1194d8a672e13dac10a8102573f787f70d) and
could reproduce the error, but it was fixed after I reverted the merge of that
PR (commit c31fb46544b9c8372e799138bad9223162169473).
### Component(s)
Parquet
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]