vigneshsiva11 opened a new pull request, #9361: URL: https://github.com/apache/arrow-rs/pull/9361
# Which issue does this PR close? - Refs #7973 # Rationale for this change This PR adds regression coverage for an offset overflow panic encountered when reading Parquet files with large binary columns using the Arrow Parquet reader. The issue occurs when a single RecordBatch contains more than 2GB of binary data and is decoded into a 'BinaryArray' using 32-bit offsets. Similar failures are observed across multiple Parquet encodings. Adding regression tests helps document the failure mode and provides a foundation for validating a follow-up fix. # What changes are included in this PR? This PR introduces a new test file under 'parquet/tests/arrow_reader' that adds regression tests covering large binary columns for the following Parquet encodings: - PLAIN - DELTA_LENGTH_BYTE_ARRAY - DELTA_BYTE_ARRAY - RLE_DICTIONARY The tests construct Parquet files that exceed the 32-bit offset limit when decoded into a single Arrow 'BinaryArray' and assert successful RecordBatch decoding. # Are these changes tested? Yes. This PR consists entirely of regression tests. The tests are currently marked as ignored to avoid excessive memory usage in CI and to document the existing failure mode. They are intended to be enabled once the underlying reader logic is updated to safely handle large binary data without overflowing offsets. # Are there any user-facing changes? No. This PR only adds tests and does not introduce any user-facing changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
