timarmstrong commented on pull request #33639: URL: https://github.com/apache/spark/pull/33639#issuecomment-942487858
Hi @huaxingao @sunchao. I saw this change go past and I have concerns about this not gracefully handling missing stats - it's a bad user experience if they need to turn off a very effective optimisation globally because of a single empty stat column in a (valid!) Parquet file. > If somehow the file doesn't have stats for the aggregate columns, Spark will throw Exception. There are many reasons why a file might not have min/max stats: * There's no requirement in the Parquet spec that writers have to include them, it's optional. * The min_value/max_value fields were only added in Parquet format 2.4.0/parquet-mr 1.9.0: https://issues.apache.org/jira/browse/PARQUET-686 * Readers might have to ignore min/max stats if there are buggy writers that produce invalid min/max values. This has happened historically. * Writers can choose not to include stats for various reasons, e.g. strings are too long. Some workloads do work with very large string data so it's not always practical to include as min/max stats. I've looked at implementing a similar optimisation to this before and I think in order for it to be robust, the reader has to fall back to scanning the column and computing min/max if the stats are not available or useable. Otherwise it's not something that's really robust or production-ready. I also think Ivan's concern is valid and there is a risk of incorrect results. The Parquet spec I think is unclear about whether truncation of min/max stats at the row group level is allowed. When the page index support was added, it was clarified that truncation *is* allowed in the page index, but it didn't address the row group level min/max stats - https://github.com/apache/parquet-format/blob/master/PageIndex.md#technical-approach. I don't know if writers do truncate in practice. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
