[GitHub] [spark] huaxingao opened a new pull request #33650: [SPARK-36351][SQL] Separate partition filters and data filters in PushDownUtils

GitBox Thu, 05 Aug 2021 01:29:18 -0700


huaxingao opened a new pull request #33650:
URL: https://github.com/apache/spark/pull/33650



   
   
   ### What changes were proposed in this pull request?
   In file based data source, there are two sets of filters: partition filters 
and data filters. For example, supposed there are c1 and c2 in table test and p 
is the partition column. For query
   
   `SELECT COUNT(c1) FROM test WHERE p = 0 AND c2 = 8`
   
   p = 0 is the partition filter and c2 = 8 is the data filter.
   
   With sql based data source, there is no problem to push down aggregate with 
filter to the underlying data source, because the data source will process the 
query. However, for file based data source, we can't push down aggregate with 
filter. This is because file based data source doesn't process query. For the 
aggregate (Min/Max/Count) push down, we simply take advantage of the statistics 
information in the data source. For example, in parquet, we read Min/Max/Count 
from the footer. For simple aggregate such as `SELECT COUNT(c1) FROM test`, we 
can push down because footer has statistics about the total number of column 
c1, but if filter is involved such as `SELECT COUNT(c1) FROM test WHERE c2=8`, 
we can't push down any more because footer only has statistics info for total 
num of column of c1, it doesn't know among these columns, how many of them can 
meet the condition c2=8. So in the aggregate push checking logic, for file 
based aggregate, I simply block aggregate push down if the
 re is filter involved. However, we can lift the restriction if the filter is 
only partition filter, because we can simply prune off the unneeded partitions. 
For example, `SELECT COUNT(c1) FROM test WHERE p = 0`, we can prune off the 
partitions with p!=0, for the remaining partition p=0, we can just have query 
`SELECT COUNT(c1) FROM test` and still push down the aggregate to data source.
   
   In order to lift the above restriction, at the time of checking whether to 
push down the aggregate, we should separate the partition filters and data 
filters, so we can push down aggregate if there is only partition filters.
   
   
   ### Why are the changes needed?
   separate the partition filters and data filters in PushDownUtils, so we can 
push down aggregate if there is only partition filter
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Existing tests
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] huaxingao opened a new pull request #33650: [SPARK-36351][SQL] Separate partition filters and data filters in PushDownUtils

Reply via email to