Count) push down for Parquet

GitBox Thu, 14 Oct 2021 00:30:36 -0700


huaxingao commented on pull request #33639:
URL: https://github.com/apache/spark/pull/33639#issuecomment-943090015

Thanks @sadikovi and @timarmstrong for taking a look at this PR!

I agree that it's ideal if we can find a way to fall back, but
unfortunately, there isn't an easy way to do so. We are deciding whether to
push down aggregate at logical plan optimization phase on the driver. If we
decide to push down aggregate, the query plan and schema in the scan will be
changed, for example, from `RelationV2[_1#9, _2#10, _3#11]` to
`RelationV2[min(_1)#24, max(_2)#25, count(_3)#26L]`. When reading footer from
parquet and finding out that stats is not available, we are on executors and
there is no way for us to go back to change the query plan to re-execute. I
couldn't find a way to fall back that's why I was following Presto's solution
to throw Exception
(https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/parquet/AggregatedParquetPageSource.java#L172).

The only way that might be possible is that if stats is not available, we
will do a full scan on that part of data in parquet reader, and calculate
max/min by ourselves. I am not sure how much work that will be. I will take a
closer look tomorrow.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] huaxingao commented on pull request #33639: [SPARK-36645][SQL] Aggregate (Min/Max/Count) push down for Parquet

Reply via email to