huaxingao opened a new pull request #33650: URL: https://github.com/apache/spark/pull/33650
### What changes were proposed in this pull request? In file based data source, there are two sets of filters: partition filters and data filters. For example, supposed there are c1 and c2 in table test and p is the partition column. For query `SELECT COUNT(c1) FROM test WHERE p = 0 AND c2 = 8` p = 0 is the partition filter and c2 = 8 is the data filter. With sql based data source, there is no problem to push down aggregate with filter to the underlying data source, because the data source will process the query. However, for file based data source, we can't push down aggregate with filter. This is because file based data source doesn't process query. For the aggregate (Min/Max/Count) push down, we simply take advantage of the statistics information in the data source. For example, in parquet, we read Min/Max/Count from the footer. For simple aggregate such as `SELECT COUNT(c1) FROM test`, we can push down because footer has statistics about the total number of column c1, but if filter is involved such as `SELECT COUNT(c1) FROM test WHERE c2=8`, we can't push down any more because footer only has statistics info for total num of column of c1, it doesn't know among these columns, how many of them can meet the condition c2=8. So in the aggregate push checking logic, for file based aggregate, I simply block aggregate push down if the re is filter involved. However, we can lift the restriction if the filter is only partition filter, because we can simply prune off the unneeded partitions. For example, `SELECT COUNT(c1) FROM test WHERE p = 0`, we can prune off the partitions with p!=0, for the remaining partition p=0, we can just have query `SELECT COUNT(c1) FROM test` and still push down the aggregate to data source. In order to lift the above restriction, at the time of checking whether to push down the aggregate, we should separate the partition filters and data filters, so we can push down aggregate if there is only partition filters. ### Why are the changes needed? separate the partition filters and data filters in PushDownUtils, so we can push down aggregate if there is only partition filter ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
