adamreeve opened a new issue, #41562:
URL: https://github.com/apache/arrow/issues/41562

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Write byte-stream-split encoded floats containing null values:
   ```python
   import pyarrow as pa
   import pyarrow.parquet as pq
   
   num_rows = 10_230
   xs = pa.array(
           [None if i % 10 == 5 else (i / 3.14) for i in range(num_rows)],
           type=pa.float32())
   
   table = pa.Table.from_arrays([xs], names=['x'])
   pq.write_table(
           table, 'data.parquet',
           use_byte_stream_split=True,
           use_dictionary=False)
   ```
   
   And then attempt to read the data back:
   ```python
   import pyarrow as pa
   import pyarrow.parquet as pq
   
   table = pq.read_table('data.parquet')
   xs = table['x']
   
   num_rows = 10_230
   assert len(xs) == num_rows
   for i in range(num_rows):
       value = xs[i]
       if i % 10 == 5:
           assert not value.is_valid
       else:
           assert value.is_valid
           assert value.equals(pa.scalar(i / 3.14, type=pa.float32()))
   ```
   
   The above code works with pyarrow 15.0.2 but fails with pyarrow 16.0.0 with 
the following exception:
   ```
   Traceback (most recent call last):
     File 
"/home/adam/dev/parquet-issues/null-byte-stream-split-regression/read_data.py", 
line 3, in <module>
       table = pq.read_table('data.parquet')
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/adam/dev/virtualenvs/ml/lib64/python3.12/site-packages/pyarrow/parquet/core.py",
 line 1811, in read_table
       return dataset.read(columns=columns, use_threads=use_threads,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/adam/dev/virtualenvs/ml/lib64/python3.12/site-packages/pyarrow/parquet/core.py",
 line 1454, in read
       table = self._dataset.to_table(
               ^^^^^^^^^^^^^^^^^^^^^^^
     File "pyarrow/_dataset.pyx", line 562, in pyarrow._dataset.Dataset.to_table
     File "pyarrow/_dataset.pyx", line 3804, in 
pyarrow._dataset.Scanner.to_table
     File "pyarrow/error.pxi", line 154, in 
pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
   OSError: Data size (36828) does not match number of values in 
BYTE_STREAM_SPLIT (10230)
   ```
   
   Writing the data with pyarrow 15.0.2 and reading with pyarrow 16.0.0 also 
fails, but writing with 16.0.0 and reading with 15.0.2 works fine. Disabling 
byte stream split encoding or not writing any nulls also makes the error go 
away.
   
   This looks related to #28737 although the error there was quite different.
   
   ### Component(s)
   
   C++, Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to