[GitHub] [hudi] yui2010 commented on pull request #2378: [HUDI-1491] Support partition pruning for MOR snapshot query

GitBox Sun, 27 Dec 2020 19:17:58 -0800


yui2010 commented on pull request #2378:
URL: https://github.com/apache/hudi/pull/2378#issuecomment-751559719

Hi @garyli1019 sorry for reply late.
1. About partition pruning . it will skip unneeded data for example:
there are follow partitions:
/hudi_ws/order/dt=20200801
/hudi_ws/order/dt=20200802
... ...
/hudi_ws/order/dt=20200831
Query sql like " `select * from order where dt>'20200820'` ". it will
start 31 tasks and 20 tasks which partition are in
{20200801,...,20200820} are not need running . if we support partition
pruning (current implement in spark 1. built-in
FileSourceStrategy 2. spark v3 datasource v2
[https://issues.apache.org/jira/browse/SPARK-30428]
(https://issues.apache.org/jira/browse/SPARK-30428) ). it will skip
unneeded partition data
and only running 11 tasks which partition are in
{20200821,...,20200831}. it will make more effective

2. About 'CatalystScan other than PrunedFilteredScan'
This is just for the convenience of using `listFiles` method
partitionFilters parameter
`PartitioningAwareFileIndex#listFiles(partitionFilters:
Seq[Expression], dataFilters: Seq[Expression])`
CatalystScan was marked 'Experimental' as you mentioned
CatalystScan and PrunedFilteredScan built thought DataSourceStrategy.
I investigated it. it didn't have any special logic. They just passed different
parameters. and there is usecase
(`https://github.com/Huawei-Spark/Spark-SQL-on-HBase/blob/master/src/main/scala/org/apache/spark/sql/hbase/HBaseRelation.scala`)
I'm also troubled with ( 1. FileSourceStrategy did partitionKeyFilters
but no in DataSourceStrategy 2. using CatalystScan or PrunedFilteredScan 3.
split allPredicates to partitionfilter and datafilter) when I implement this
function. and do you have any suggestions on the ' support partition pruning'

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] yui2010 commented on pull request #2378: [HUDI-1491] Support partition pruning for MOR snapshot query

Reply via email to