nssalian opened a new pull request, #16327: URL: https://github.com/apache/iceberg/pull/16327
Closes: #16325 ## Rationale for this Change Adds `write.parquet.row-group-size-check-uncompressed` (default false) to accurately enforce `write.parquet.row-group-size-bytes` when using compressing codecs (GZIP, ZSTD, etc.). `ParquetWriter.checkSize()` uses `writeStore.getBufferedSize()` which reports compressed bytes for flushed pages. With effective compression, the writer never sees the target exceeded because it's comparing compressed data against an uncompressed limit. Row groups grow unbounded. ## What changes are included in this PR? When `write.parquet.row-group-size-check-uncompressed=true`: 1. Measures `getBufferedSize()` before and after `model.write()` per record. Between these points, data is in uncompressed column buffers (no page flush occurs during `model.write()`). The delta is the exact uncompressed record size. 2. Accumulates into `rowGroupUncompressedSize`. Flushes when it hits the target. 3. Removes the 100-record minimum check interval floor for the uncompressed path. Disabled by default. When enabled `getBufferedSize()` calls per record. Each call iterates column writers adding field reads. It's the same pattern parquet-mr uses in [`ColumnWriteStoreBase.sizeCheck()`](https://github.com/apache/parquet-java/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriteStoreBase.java#L208-L250). ## Are these changes tested? - [x] Parameterized test across all codecs (gzip, snappy, zstd, uncompressed) - [x] Existing parquet tests pass locally ## Are there any user-facing changes? Yes. New configuration but set to `false` by default. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
