yui2010 commented on pull request #2378:
URL: https://github.com/apache/hudi/pull/2378#issuecomment-751559719


   Hi @garyli1019 sorry for reply late.
         1. About partition pruning . it will skip unneeded data for example:
           there are follow partitions: 
           /hudi_ws/order/dt=20200801
           /hudi_ws/order/dt=20200802
                ... ...
           /hudi_ws/order/dt=20200831
        Query sql like " `select * from order where dt>'20200820'` ". it will 
start 31 tasks and 20 tasks which partition are in  
        {20200801,...,20200820} are not need running . if  we support partition 
pruning (current implement in spark 1. built-in 
        FileSourceStrategy 2. spark v3 datasource v2  
[https://issues.apache.org/jira/browse/SPARK-30428] 
        (https://issues.apache.org/jira/browse/SPARK-30428) ). it will skip 
unneeded partition data 
        and only running 11 tasks which partition are in  
{20200821,...,20200831}. it will make more effective
       
      2. About 'CatalystScan other than PrunedFilteredScan'
          This is just for the convenience of using `listFiles` method 
partitionFilters parameter
        `PartitioningAwareFileIndex#listFiles(partitionFilters: 
Seq[Expression], dataFilters: Seq[Expression])`
         CatalystScan was marked 'Experimental' as you mentioned 
         CatalystScan and PrunedFilteredScan built thought DataSourceStrategy. 
I investigated it. it didn't have any special logic. They just passed different 
parameters. and there is usecase 
(`https://github.com/Huawei-Spark/Spark-SQL-on-HBase/blob/master/src/main/scala/org/apache/spark/sql/hbase/HBaseRelation.scala`)
        I'm also troubled with ( 1. FileSourceStrategy did partitionKeyFilters 
but no in DataSourceStrategy 2. using CatalystScan or PrunedFilteredScan 3. 
split allPredicates to partitionfilter and datafilter) when I implement this 
function. and do you have any suggestions on the ' support partition pruning'
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to