wgtmac commented on code in PR #3556:
URL: https://github.com/apache/parquet-java/pull/3556#discussion_r3360665791
##########
parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java:
##########
@@ -57,16 +70,16 @@ public static <I extends ValuesWriter & RequiresFallback, F
extends ValuesWriter
*/
private long rawDataByteSize = 0;
- /**
- * indicates if this is the first page being processed
- */
- private boolean firstPage = true;
-
public FallbackValuesWriter(I initialWriter, F fallBackWriter) {
+ this(initialWriter, fallBackWriter, 0);
Review Comment:
```suggestion
this(initialWriter, fallBackWriter, /*checkAfterBytes=*/0);
```
##########
parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java:
##########
@@ -44,6 +49,14 @@ public static <I extends ValuesWriter & RequiresFallback, F
extends ValuesWriter
private boolean fellBackAlready = false;
+ private boolean compressionChecked = false;
+
+ private final long checkAfterBytes;
+
+ /* Accumulates raw bytes across pages (only reset in resetDictionary) so the
+ * threshold check works even when individual pages are smaller than
checkAfterBytes. */
+ private long cumulativeRawBytes = 0;
Review Comment:
```suggestion
private boolean compressionChecked = false;
private final long checkAfterBytes;
/* Accumulates raw bytes across pages (only reset in resetDictionary) so
the
* threshold check works even when individual pages are smaller than
checkAfterBytes. */
private long cumulativeRawBytes = 0;
```
How about removing these blank lines?
##########
parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java:
##########
@@ -30,7 +30,12 @@ public class FallbackValuesWriter<I extends ValuesWriter &
RequiresFallback, F e
public static <I extends ValuesWriter & RequiresFallback, F extends
ValuesWriter> FallbackValuesWriter<I, F> of(
I initialWriter, F fallBackWriter) {
- return new FallbackValuesWriter<>(initialWriter, fallBackWriter);
+ return new FallbackValuesWriter<>(initialWriter, fallBackWriter, 0);
Review Comment:
```suggestion
return new FallbackValuesWriter<>(initialWriter, fallBackWriter,
/*checkAfterBytes=*/0);
```
##########
parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java:
##########
@@ -167,6 +180,7 @@ private void fallBack() {
@Override
public void writeByte(int value) {
rawDataByteSize += 1;
+ cumulativeRawBytes += 1;
Review Comment:
Should we use `cumulativeRawBytes + rawDataByteSize` for checking and only
increase `cumulativeRawBytes` once a page to save some cycles?
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java:
##########
@@ -161,6 +161,7 @@ public static enum JobSummaryLevel {
public static final String BLOCK_ROW_COUNT_LIMIT =
"parquet.block.row.count.limit";
public static final String PAGE_ROW_COUNT_LIMIT =
"parquet.page.row.count.limit";
public static final String PAGE_WRITE_CHECKSUM_ENABLED =
"parquet.page.write-checksum.enabled";
+ public static final String DICTIONARY_CHECK_AFTER_BYTES =
"parquet.dictionary.check.after.raw.bytes";
Review Comment:
TBH, the flag name still looks a little bit unclear to me. How about
renaming it to `parquet.dictionary.check.threshold.raw.size.bytes` and also
change associated variable and function names?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]