yadavay-amzn commented on PR #3556: URL: https://github.com/apache/parquet-java/pull/3556#issuecomment-4626286799
@wgtmac Good catches, thanks. Updated: 1. **Renamed** config to `parquet.dictionary.check.after.raw.bytes` to clarify it refers to uncompressed bytes. 2. **Fixed the reset issue.** You were right — `rawDataByteSize` resets per page via `reset()`, so it would never reach a 1MB threshold. Added a separate `cumulativeRawBytes` counter that accumulates across pages and only resets in `resetDictionary()` (between column chunks). The threshold gate uses `cumulativeRawBytes`; the actual compression comparison still uses the current page's `rawDataByteSize` vs encoded size — same apples-to-apples comparison as before. 3. **Default is now 0** (backward compatible — check fires on first page, same as old `firstPage` behavior). Users can opt in to a higher threshold to delay the check. 4. **Null-heavy columns:** with default 0 the check fires on the first page regardless of null ratio, same as before. With a higher threshold, nulls don't contribute to `cumulativeRawBytes` (they're in definition levels), so the threshold takes longer to reach — but the check still eventually fires once enough non-null values accumulate. For all-null columns the check fires immediately since `cumulativeRawBytes >= 0` is trivially true. Let me know if this direction works or if you'd prefer a different approach. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
