jorisvandenbossche commented on issue #38577: URL: https://github.com/apache/arrow/issues/38577#issuecomment-1826032128
From seeing the potential fix in https://github.com/apache/arrow/pull/38784, I could manage to create a simple reproducer: Creating this file with pyarrow 13.0 reads fine with that version: ```python import string import numpy as np import pyarrow as pa # column with >2GB data data = ["".join(np.random.choice(list(string.ascii_letters), n)) for n in np.random.randint(10, 500, size=10_000)] table = pa.table({'a': pa.array(data*1000)}) import pyarrow.parquet as pq pq.write_table(table, "test_capacity.parquet") ``` but reading with pyarrow 14: ``` import pyarrow.parquet as pq pf = pq.ParquetFile("test_capacity.parquet") In [6]: pf.read() ... ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 2148282365 /home/joris/scipy/repos/arrow/cpp/src/arrow/array/builder_binary.h:332 ValidateOverflow(elements) /home/joris/scipy/repos/arrow/cpp/src/parquet/encoding.cc:1202 acc_->builder->ReserveData( std::min<int64_t>(*estimated_data_length, ::arrow::kBinaryMemoryLimit)) /home/joris/scipy/repos/arrow/cpp/src/parquet/encoding.cc:1407 helper.Prepare(len_) /home/joris/scipy/repos/arrow/cpp/src/parquet/arrow/reader.cc:109 LoadBatch(batch_size) /home/joris/scipy/repos/arrow/cpp/src/parquet/arrow/reader.cc:1252 ReadColumn(static_cast<int>(i), row_groups, reader.get(), &column) /home/joris/scipy/repos/arrow/cpp/src/parquet/arrow/reader.cc:1233 fut.MoveResult() ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
