jorisvandenbossche commented on issue #38577:
URL: https://github.com/apache/arrow/issues/38577#issuecomment-1826032128

   From seeing the potential fix in https://github.com/apache/arrow/pull/38784, 
I could manage to create a simple reproducer:
   
   Creating this file with pyarrow 13.0 reads fine with that version:
   ```python
   import string
   import numpy as np
   import pyarrow as pa
   
   # column with >2GB data
   data = ["".join(np.random.choice(list(string.ascii_letters), n)) for n in 
np.random.randint(10, 500, size=10_000)]
   table = pa.table({'a': pa.array(data*1000)})
   
   import pyarrow.parquet as pq
   pq.write_table(table, "test_capacity.parquet")
   ```
   
   but reading with pyarrow 14:
   
   ```
   import pyarrow.parquet as pq
   pf = pq.ParquetFile("test_capacity.parquet")
   
   In [6]: pf.read()
   ...
   ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 
2148282365
   /home/joris/scipy/repos/arrow/cpp/src/arrow/array/builder_binary.h:332  
ValidateOverflow(elements)
   /home/joris/scipy/repos/arrow/cpp/src/parquet/encoding.cc:1202  
acc_->builder->ReserveData( std::min<int64_t>(*estimated_data_length, 
::arrow::kBinaryMemoryLimit))
   /home/joris/scipy/repos/arrow/cpp/src/parquet/encoding.cc:1407  
helper.Prepare(len_)
   /home/joris/scipy/repos/arrow/cpp/src/parquet/arrow/reader.cc:109  
LoadBatch(batch_size)
   /home/joris/scipy/repos/arrow/cpp/src/parquet/arrow/reader.cc:1252  
ReadColumn(static_cast<int>(i), row_groups, reader.get(), &column)
   /home/joris/scipy/repos/arrow/cpp/src/parquet/arrow/reader.cc:1233  
fut.MoveResult()
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to