[GitHub] [arrow] adamreeve opened a new issue, #35423: "Data size too large" error with byte-stream-split encoded data since Arrow 12.0.0

via GitHub Wed, 03 May 2023 19:11:09 -0700


adamreeve opened a new issue, #35423:
URL: https://github.com/apache/arrow/issues/35423


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Arrow 12.0.0 has a regression where it can crash when reading byte-stream 
split encoded data written by itself or older versions of Arrow:
   
   ```python
   import numpy as np
   import pyarrow as pa
   import pyarrow.parquet as pq
   
   x = pa.array(np.linspace(0.0, 1.0, 1_000_000), type=pa.float32())
   table = pa.Table.from_arrays([x], names=['x'])
   pq.write_table(table, 'data.parquet', use_dictionary=False, 
use_byte_stream_split=True)
   
   table = pq.read_table('data.parquet')
   print(table)
   ```
   This crashes with:
   ```
   Traceback (most recent call last):
     File "/home/.../write_read_data.py", line 9, in <module>
        table = pq.read_table('data.parquet')
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/adam/.local/share/virtualenvs/arrow12-nieuEBn0/lib64/python3.11/site-packages/pyarrow/parquet/core.py",
 line 2986, in read_table
        return dataset.read(columns=columns, use_threads=use_threads,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/adam/.local/share/virtualenvs/arrow12-nieuEBn0/lib64/python3.11/site-packages/pyarrow/parquet/core.py",
 line 2614, in read
        table = self._dataset.to_table(
                        ^^^^^^^^^^^^^^^^^^^^^^^
     File "pyarrow/_dataset.pyx", line 546, in pyarrow._dataset.Dataset.to_table
     File "pyarrow/_dataset.pyx", line 3449, in 
pyarrow._dataset.Scanner.to_table
     File "pyarrow/error.pxi", line 144, in 
pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
   OSError: Data size too large for number of values (padding in byte stream 
split data page?)
   ```
   But the above code works fine with pyarrow 11.0.0 and and 10.0.1.
   
   It appears that #34140 caused this regression. I tested building pyarrow on 
the current main branch (commit 42d42b1194d8a672e13dac10a8102573f787f70d) and 
could reproduce the error, but it was fixed after I reverted the merge of that 
PR (commit c31fb46544b9c8372e799138bad9223162169473).
   
   ### Component(s)
   
   Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] adamreeve opened a new issue, #35423: "Data size too large" error with byte-stream-split encoded data since Arrow 12.0.0

Reply via email to