vigneshsiva11 opened a new pull request, #9362: URL: https://github.com/apache/arrow-rs/pull/9362
# Which issue does this PR close? - Closes #7973. # Rationale for this change When reading Parquet files containing very large binary or string values, the Arrow Parquet reader can attempt to construct a RecordBatch whose total value buffer exceeds the maximum representable offset size. This can lead to an overflow error or panic during decoding. Instead of allowing the buffer to overflow and failing late, the reader should detect this condition early and stop decoding before the offset exceeds the representable limit. This behavior is consistent with other Arrow implementations (for example, PyArrow), which emit smaller batches when encountering very large row groups. # What changes are included in this PR? - Add an early overflow check when appending binary values to the Arrow offset buffer. - Ensure the overflow condition is detected before mutating internal buffers. - Return a controlled error instead of panicking when the offset limit would be exceeded. - Apply the fix uniformly across all byte array decoding paths (plain, dictionary, and delta encodings) via the shared offset buffer logic. # Are these changes tested? Yes. - Regression tests covering large binary values were added in a separate PR. - Existing Parquet reader and writer tests continue to pass in CI. Note: Some Parquet and Arrow integration tests require external test data provided via git submodules (parquet-testing and testing). These submodules are not present in a minimal local checkout but are initialized in CI. # Are there any user-facing changes? Yes. - Reading Parquet files with very large binary or string columns will no longer panic or fail late due to offset overflow. - The reader now stops batch construction early and reports the error safely. There are no breaking changes to public APIs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
