yadavay-amzn commented on PR #3556:
URL: https://github.com/apache/parquet-java/pull/3556#issuecomment-4626286799

   @wgtmac Good catches, thanks. Updated:
   
   1. **Renamed** config to `parquet.dictionary.check.after.raw.bytes` to 
clarify it refers to uncompressed bytes.
   
   2. **Fixed the reset issue.** You were right — `rawDataByteSize` resets per 
page via `reset()`, so it would never reach a 1MB threshold. Added a separate 
`cumulativeRawBytes` counter that accumulates across pages and only resets in 
`resetDictionary()` (between column chunks). The threshold gate uses 
`cumulativeRawBytes`; the actual compression comparison still uses the current 
page's `rawDataByteSize` vs encoded size — same apples-to-apples comparison as 
before.
   
   3. **Default is now 0** (backward compatible — check fires on first page, 
same as old `firstPage` behavior). Users can opt in to a higher threshold to 
delay the check.
   
   4. **Null-heavy columns:** with default 0 the check fires on the first page 
regardless of null ratio, same as before. With a higher threshold, nulls don't 
contribute to `cumulativeRawBytes` (they're in definition levels), so the 
threshold takes longer to reach — but the check still eventually fires once 
enough non-null values accumulate. For all-null columns the check fires 
immediately since `cumulativeRawBytes >= 0` is trivially true.
   
   Let me know if this direction works or if you'd prefer a different approach.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to