nssalian opened a new pull request, #16327:
URL: https://github.com/apache/iceberg/pull/16327

   Closes: #16325 
   
   ## Rationale for this Change
   
   Adds `write.parquet.row-group-size-check-uncompressed` (default false) to 
accurately enforce `write.parquet.row-group-size-bytes` when using compressing 
codecs (GZIP, ZSTD, etc.).
   
   `ParquetWriter.checkSize()` uses `writeStore.getBufferedSize()` which 
reports compressed bytes for flushed pages. With effective compression, the 
writer never sees the target exceeded because it's comparing compressed data 
against an uncompressed limit. Row groups grow unbounded.
   
   ## What changes are included in this PR?
   
   When `write.parquet.row-group-size-check-uncompressed=true`:
   
   1. Measures `getBufferedSize()` before and after `model.write()` per record. 
Between these points, data is in uncompressed column buffers (no page flush 
occurs during `model.write()`). The delta is the exact uncompressed record size.
   2. Accumulates into `rowGroupUncompressedSize`. Flushes when it hits the 
target.
   3. Removes the 100-record minimum check interval floor for the uncompressed 
path.
   
   Disabled by default.
   
   When enabled `getBufferedSize()` calls per record. Each call iterates column 
writers adding field reads. It's the same pattern parquet-mr uses in 
[`ColumnWriteStoreBase.sizeCheck()`](https://github.com/apache/parquet-java/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriteStoreBase.java#L208-L250).
 
   
   ## Are these changes tested?
   
   - [x] Parameterized test across all codecs (gzip, snappy, zstd, uncompressed)
   - [x] Existing parquet tests pass locally
   
   ## Are there any user-facing changes?
   Yes. New configuration but set to `false` by default.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to