wgtmac commented on PR #3556: URL: https://github.com/apache/parquet-java/pull/3556#issuecomment-4619402500
Thanks for the quick update. Before diving into the code detail, I have some additional questions: - `parquet.dictionary.check.after.bytes` is confusing. Does "bytes" refer to raw (uncompressed) bytes or compressed/encoded bytes? Please rename it to be explicit (e.g., parquet.dictionary.check.after.raw.bytes) and update the documentation accordingly. - In FallbackValuesWriter, rawDataByteSize is reset to 0 at the end of every page via reset(). With modern configurations (e.g., 20k rows per page), a single page's size will almost never reach the 1MB threshold. This effectively disables the compression check entirely. Did you intend for this data size to accumulate across pages, or is permanently bypassing the check for small pages your actual design? - Since nulls are encoded in definition levels and do not contribute to rawDataByteSize, a page with a high null ratio will have an extremely small raw data size. How does this algorithm account for heavily null-populated columns? Will it also indefinitely bypass the check? - It would be good to make the default behavior unchanged since many downstream environments (e.g. Apache Spark) unfortunately check produced file metadata (including encoding, file size, etc.) in their integration tests so behavior change may break them. cc @gszadovszky @Fokko for additional advice. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
