yadavay-amzn opened a new pull request, #16347:
URL: https://github.com/apache/iceberg/pull/16347

   Fixes #16325.
   
   ## Problem
   
   When using GZIP or ZSTD compression, the row group size check in 
`ParquetWriter` uses `writeStore.getBufferedSize()` which reports compressed 
bytes after page flushes. Since compressed size is significantly smaller than 
the configured `targetRowGroupSize`, the threshold is never reached and row 
groups grow unbounded.
   
   ## Fix
   
   Track uncompressed bytes by measuring the `getBufferedSize()` delta before 
and after each `model.write()` call (before `endRecord()` triggers page flush 
and compression). Use this accumulated uncompressed size in `checkSize()` 
instead of the post-compression buffered size. Reset on row group flush.
   
   ## Testing
   
   Added `testRowGroupSizeEnforcedWithCompression` in `TestParquet` -- writes 
500 records of ~1KB each with GZIP compression and a 64KB row group target. 
Asserts multiple row groups are created.
   
   - **Without fix**: all 500 records end up in 1 row group (compressed size 
never hits threshold)
   - **With fix**: multiple row groups created respecting the 64KB target
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to