aryansri05 opened a new pull request, #49513:
URL: https://github.com/apache/arrow/pull/49513

   Refer#49502
   Rationale for this change
   
   When writing large dictionary-encoded Parquet data with 
   `ARROW_LARGE_MEMORY_TESTS=ON`, two tests were failing:
   
   - `TestColumnWriter.WriteLargeDictEncodedPage` — expected 2 pages, got 7501
   - `TestColumnWriter.ThrowsOnDictIndicesTooLarge` — expected 
ParquetException, 
     got nothing thrown
   
   The root cause is that `PutIndicesTyped()` in `DictEncoderImpl` had no check 
   for when the total number of buffered dictionary indices exceeds 
`INT32_MAX`. 
   The existing overflow check in `FlushValues()` only checks the buffer size 
in 
   bytes, not the index count, so it never triggered for this case.
   
   What changes are included in this PR?
   
   Added an overflow check in `DictEncoderImpl::PutIndicesTyped()` immediately 
   after `buffered_indices_.resize()`:
   
   if (buffered_indices_.size() >
       static_cast<size_t>(std::numeric_limits<int32_t>::max())) {
     throw ParquetException("Total dictionary indices count (",
                            buffered_indices_.size(),
                            ") exceeds maximum int value");
   }
   
   This makes the encoder throw a `ParquetException` with a message containing 
   "exceeds maximum int value" when the index count overflows, which is exactly 
   what `ThrowsOnDictIndicesTooLarge` expects.
   
   ### Are these changes tested?
   
   Yes — the existing tests in `column_writer_test.cc` cover this fix:
   - `TestColumnWriter.ThrowsOnDictIndicesTooLarge`
   - `TestColumnWriter.WriteLargeDictEncodedPage`
   
   Both tests were failing before this fix and should pass after.
   Tests require building with `ARROW_LARGE_MEMORY_TESTS=ON`.
   
   
   This PR contains a "Critical Fix"— previously, writing dictionary-encoded 
   data with more than INT32_MAX indices would silently produce incorrect 
output 
   (wrong page count) instead of raising an error. This fix makes the encoder 
   correctly throw a `ParquetException` in that scenario.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to