vigneshsiva11 opened a new pull request, #9361:
URL: https://github.com/apache/arrow-rs/pull/9361

   # Which issue does this PR close?
   
   - Refs #7973
   
   # Rationale for this change
   
   This PR adds regression coverage for an offset overflow panic encountered 
when reading Parquet files with large binary columns using the Arrow Parquet 
reader.
   
   The issue occurs when a single RecordBatch contains more than 2GB of binary 
data and is decoded into a 'BinaryArray' using 32-bit offsets. Similar failures 
are observed across multiple Parquet encodings. Adding regression tests helps 
document the failure mode and provides a foundation for validating a follow-up 
fix.
   
   # What changes are included in this PR?
   
   This PR introduces a new test file under 'parquet/tests/arrow_reader' that 
adds regression tests covering large binary columns for the following Parquet 
encodings:
   
   - PLAIN  
   - DELTA_LENGTH_BYTE_ARRAY  
   - DELTA_BYTE_ARRAY  
   - RLE_DICTIONARY  
   
   The tests construct Parquet files that exceed the 32-bit offset limit when 
decoded into a single Arrow 'BinaryArray' and assert successful RecordBatch 
decoding.
   
   # Are these changes tested?
   
   Yes. This PR consists entirely of regression tests.
   
   The tests are currently marked as ignored to avoid excessive memory usage in 
CI and to document the existing failure mode. They are intended to be enabled 
once the underlying reader logic is updated to safely handle large binary data 
without overflowing offsets.
   
   # Are there any user-facing changes?
   
   No. This PR only adds tests and does not introduce any user-facing changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to