yadavay-amzn commented on PR #3556:
URL: https://github.com/apache/parquet-java/pull/3556#issuecomment-4697120061

   Good catch, thanks @wgtmac. You'''re right — the trigger became cumulative 
but the comparison was still per-page against the cumulative dictionary, which 
biases toward fallback the later the check fires.
   
   I'''ve pushed a fix that tracks `cumulativeEncodedBytes` alongside 
`cumulativeRawBytes` and changes the decision to 
`isCompressionSatisfying(cumulativeRawBytes, cumulativeEncodedBytes)`, so the 
column-chunk dictionary is amortized over all the pages it covers rather than 
charged against a single page. The encoded size is accumulated from the 
`BytesInput` already produced each page (no extra `getBytes()` calls), and both 
accumulators now use `Math.addExact` consistently and reset together in 
`resetDictionary()`.
   
   I added a test (`TestFallbackCumulativeBias`) that reproduces exactly the 
case you described — repeated/moderate-cardinality values across multiple pages 
with a finite threshold that fires on a later page. The arithmetic witness:
   - **Before (per-page):** `encoded(93) + dict(400) = 493 >= pageRaw(400)` -> 
fallback
   - **After (cumulative):** `totalEncoded(173) + dict(400) = 573 < 
totalRaw(800)` -> dictionary kept
   
   It fails on the previous commit and passes now. I also added an adversarial 
test confirming a genuinely high-cardinality column still correctly falls back 
(the fix amortizes the dictionary cost, it doesn'''t disable fallback), plus a 
`resetDictionary` isolation test so one chunk'''s history doesn'''t leak into 
the next.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to