huaxingao commented on pull request #33639:
URL: https://github.com/apache/spark/pull/33639#issuecomment-943090015


   Thanks @sadikovi and @timarmstrong for taking a look at this PR!
   
   I agree that it's ideal if we can find a way to fall back, but 
unfortunately, there isn't an easy way to do so. We are deciding whether to 
push down aggregate at logical plan optimization phase on the driver. If we 
decide to push down aggregate, the query plan and schema in the scan will be 
changed, for example, from `RelationV2[_1#9, _2#10, _3#11]` to 
`RelationV2[min(_1)#24, max(_2)#25, count(_3)#26L]`. When reading footer from 
parquet and finding out that stats is not available, we are on executors and 
there is no way for us to go back to change the query plan to re-execute. I 
couldn't find a way to fall back that's why I was following Presto's solution 
to throw Exception 
(https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/parquet/AggregatedParquetPageSource.java#L172).
   
   The only way that might be possible is that if stats is not available, we 
will do a full scan on that part of data in parquet reader, and calculate 
max/min by ourselves. I am not sure how much work that will be. I will take a 
closer look tomorrow.
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to