wgtmac commented on code in PR #3556:
URL: https://github.com/apache/parquet-java/pull/3556#discussion_r3360641976


##########
parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java:
##########
@@ -79,8 +92,8 @@ public long getBufferedSize() {
 
   @Override
   public BytesInput getBytes() {
-    if (!fellBackAlready && firstPage) {
-      // we use the first page to decide if we're going to use this encoding
+    if (!fellBackAlready && !compressionChecked && cumulativeRawBytes >= 
checkAfterBytes) {

Review Comment:
   Could we add a test where the threshold is crossed only after a `reset()`? 
This is the exact case we want to protect here, since `rawDataByteSize` is 
reset per page while `cumulativeRawBytes` should keep accumulating across 
pages. The current tests with `0` and `Long.MAX_VALUE` would not catch a 
regression back to per-page counting.



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java:
##########
@@ -771,6 +771,17 @@ public SELF withPageWriteChecksumEnabled(boolean 
enablePageWriteChecksum) {
       return self();
     }
 
+    /**
+     * Set the raw data byte threshold after which the dictionary compression 
check is performed.
+     *
+     * @param val byte threshold (0 means check on every page)

Review Comment:
   Nit: this says `0 means check on every page`, but the implementation only 
checks once per column chunk, same as the old first-page behavior. Could we 
update this to say `0` means check on the first page / preserve the previous 
behavior?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to