wgtmac commented on PR #3556:
URL: https://github.com/apache/parquet-java/pull/3556#issuecomment-4619402500

   Thanks for the quick update. Before diving into the code detail, I have some 
additional questions:
   
   - `parquet.dictionary.check.after.bytes` is confusing. Does "bytes" refer to 
raw (uncompressed) bytes or compressed/encoded bytes? Please rename it to be 
explicit (e.g., parquet.dictionary.check.after.raw.bytes) and update the 
documentation accordingly.
   - In FallbackValuesWriter, rawDataByteSize is reset to 0 at the end of every 
page via reset(). With modern configurations (e.g., 20k rows per page), a 
single page's size will almost never reach the 1MB threshold. This effectively 
disables the compression check entirely. Did you intend for this data size to 
accumulate across pages, or is permanently bypassing the check for small pages 
your actual design?
   - Since nulls are encoded in definition levels and do not contribute to 
rawDataByteSize, a page with a high null ratio will have an extremely small raw 
data size. How does this algorithm account for heavily null-populated columns? 
Will it also indefinitely bypass the check?
   - It would be good to make the default behavior unchanged since many 
downstream environments (e.g. Apache Spark) unfortunately check produced file 
metadata (including encoding, file size, etc.) in their integration tests so 
behavior change may break them.
   
   cc @gszadovszky @Fokko for additional advice.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to