huaxingao commented on pull request #33639: URL: https://github.com/apache/spark/pull/33639#issuecomment-943090015
Thanks @sadikovi and @timarmstrong for taking a look at this PR! I agree that it's ideal if we can find a way to fall back, but unfortunately, there isn't an easy way to do so. We are deciding whether to push down aggregate at logical plan optimization phase on the driver. If we decide to push down aggregate, the query plan and schema in the scan will be changed, for example, from `RelationV2[_1#9, _2#10, _3#11]` to `RelationV2[min(_1)#24, max(_2)#25, count(_3)#26L]`. When reading footer from parquet and finding out that stats is not available, we are on executors and there is no way for us to go back to change the query plan to re-execute. I couldn't find a way to fall back that's why I was following Presto's solution to throw Exception (https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/parquet/AggregatedParquetPageSource.java#L172). The only way that might be possible is that if stats is not available, we will do a full scan on that part of data in parquet reader, and calculate max/min by ourselves. I am not sure how much work that will be. I will take a closer look tomorrow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
