yadavay-amzn commented on PR #3556: URL: https://github.com/apache/parquet-java/pull/3556#issuecomment-4697120061
Good catch, thanks @wgtmac. You'''re right — the trigger became cumulative but the comparison was still per-page against the cumulative dictionary, which biases toward fallback the later the check fires. I'''ve pushed a fix that tracks `cumulativeEncodedBytes` alongside `cumulativeRawBytes` and changes the decision to `isCompressionSatisfying(cumulativeRawBytes, cumulativeEncodedBytes)`, so the column-chunk dictionary is amortized over all the pages it covers rather than charged against a single page. The encoded size is accumulated from the `BytesInput` already produced each page (no extra `getBytes()` calls), and both accumulators now use `Math.addExact` consistently and reset together in `resetDictionary()`. I added a test (`TestFallbackCumulativeBias`) that reproduces exactly the case you described — repeated/moderate-cardinality values across multiple pages with a finite threshold that fires on a later page. The arithmetic witness: - **Before (per-page):** `encoded(93) + dict(400) = 493 >= pageRaw(400)` -> fallback - **After (cumulative):** `totalEncoded(173) + dict(400) = 573 < totalRaw(800)` -> dictionary kept It fails on the previous commit and passes now. I also added an adversarial test confirming a genuinely high-cardinality column still correctly falls back (the fix amortizes the dictionary cost, it doesn'''t disable fallback), plus a `resetDictionary` isolation test so one chunk'''s history doesn'''t leak into the next. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
