steveloughran commented on code in PR #16327:
URL: https://github.com/apache/iceberg/pull/16327#discussion_r3249376859
##########
parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java:
##########
@@ -190,6 +204,30 @@ public List<Long> splitOffsets() {
}
private void checkSize() {
+ if (trackUncompressedSize) {
+ checkSizeUncompressed();
+ } else {
+ checkSizeDefault();
+ }
+ }
+
+ private void checkSizeUncompressed() {
+ if (rowGroupUncompressedSize >= targetRowGroupSize) {
+ flushRowGroup(false);
+ } else if (recordCount >= nextCheckRecordCount) {
+ double avgRecordSize = ((double) rowGroupUncompressedSize) / recordCount;
+ if (rowGroupUncompressedSize > (targetRowGroupSize - 2 * avgRecordSize))
{
+ flushRowGroup(false);
+ } else {
+ long remainingSpace = targetRowGroupSize - rowGroupUncompressedSize;
+ long remainingRecords = (long) (remainingSpace / avgRecordSize);
+ this.nextCheckRecordCount =
+ recordCount + Math.min(remainingRecords / 2,
props.getMaxRowCountForPageSizeCheck());
+ }
+ }
+ }
+
+ private void checkSizeDefault() {
Review Comment:
I'd give it a clearer name which makes clear it's the size on the
filesystem; "default" just says it's the default option, not what it does
##########
parquet/src/test/java/org/apache/iceberg/parquet/TestParquetDataWriter.java:
##########
@@ -541,4 +544,50 @@ protected int resolveColumnIndex(Void engineSchema, String
columnName) {
variantSchema.asStruct(), variantRecords.get(i),
writtenRecords.get(i));
}
}
+
+ @ParameterizedTest
+ @ValueSource(strings = {"gzip", "snappy", "zstd", "uncompressed"})
+ public void testRowGroupSizeEnforcedWhenCompressionEnabled(String codec)
throws IOException {
Review Comment:
is there an equivalent test which verifies that with the default setting
it's the compressed byte count that's used? that's critical for regression
testing
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]