adamreeve opened a new issue, #41562:
URL: https://github.com/apache/arrow/issues/41562
### Describe the bug, including details regarding any error messages,
version, and platform.
Write byte-stream-split encoded floats containing null values:
```python
import pyarrow as pa
import pyarrow.parquet as pq
num_rows = 10_230
xs = pa.array(
[None if i % 10 == 5 else (i / 3.14) for i in range(num_rows)],
type=pa.float32())
table = pa.Table.from_arrays([xs], names=['x'])
pq.write_table(
table, 'data.parquet',
use_byte_stream_split=True,
use_dictionary=False)
```
And then attempt to read the data back:
```python
import pyarrow as pa
import pyarrow.parquet as pq
table = pq.read_table('data.parquet')
xs = table['x']
num_rows = 10_230
assert len(xs) == num_rows
for i in range(num_rows):
value = xs[i]
if i % 10 == 5:
assert not value.is_valid
else:
assert value.is_valid
assert value.equals(pa.scalar(i / 3.14, type=pa.float32()))
```
The above code works with pyarrow 15.0.2 but fails with pyarrow 16.0.0 with
the following exception:
```
Traceback (most recent call last):
File
"/home/adam/dev/parquet-issues/null-byte-stream-split-regression/read_data.py",
line 3, in <module>
table = pq.read_table('data.parquet')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/home/adam/dev/virtualenvs/ml/lib64/python3.12/site-packages/pyarrow/parquet/core.py",
line 1811, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/home/adam/dev/virtualenvs/ml/lib64/python3.12/site-packages/pyarrow/parquet/core.py",
line 1454, in read
table = self._dataset.to_table(
^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_dataset.pyx", line 562, in pyarrow._dataset.Dataset.to_table
File "pyarrow/_dataset.pyx", line 3804, in
pyarrow._dataset.Scanner.to_table
File "pyarrow/error.pxi", line 154, in
pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: Data size (36828) does not match number of values in
BYTE_STREAM_SPLIT (10230)
```
Writing the data with pyarrow 15.0.2 and reading with pyarrow 16.0.0 also
fails, but writing with 16.0.0 and reading with 15.0.2 works fine. Disabling
byte stream split encoding or not writing any nulls also makes the error go
away.
This looks related to #28737 although the error there was quite different.
### Component(s)
C++, Parquet
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]