wgtmac commented on code in PR #3556:
URL: https://github.com/apache/parquet-java/pull/3556#discussion_r3394400538


##########
parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java:
##########
@@ -79,8 +91,13 @@ public long getBufferedSize() {
 
   @Override
   public BytesInput getBytes() {
-    if (!fellBackAlready && firstPage) {
-      // we use the first page to decide if we're going to use this encoding
+    try {
+      cumulativeRawBytes = Math.addExact(cumulativeRawBytes, rawDataByteSize);
+    } catch (ArithmeticException e) {
+      // overflow, keep the previous value
+    }
+    if (!fellBackAlready && !compressionChecked && cumulativeRawBytes >= 
dictionaryCheckThresholdRawSizeBytes) {
+      compressionChecked = true;
       BytesInput bytes = initialWriter.getBytes();
       if (!initialWriter.isCompressionSatisfying(rawDataByteSize, 
bytes.size())) {

Review Comment:
   I think the threshold gate now accumulates raw bytes across pages, but the 
compression check still compares the current page raw size against the current 
page encoded size plus the whole accumulated dictionary size. This seems biased 
toward fallback once the check is delayed beyond the first page. Should this 
comparison use cumulative raw/encoded sizes, or otherwise avoid charging the 
full column-chunk dictionary size to a single page? A finite-threshold test 
with repeated/moderate-cardinality values across pages would help verify the 
intended PARQUET-3479 case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to