[ https://issues.apache.org/jira/browse/SPARK-37172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17437135#comment-17437135 ]
Chungmin commented on SPARK-37172: ---------------------------------- I can work on this if the rationale seems okay. > Push down filters having both partitioning and non-partitioning columns > ----------------------------------------------------------------------- > > Key: SPARK-37172 > URL: https://issues.apache.org/jira/browse/SPARK-37172 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.2.0 > Reporter: Chungmin > Priority: Major > > Currently, filters having both partitioning and non-partitioning columns are > lost during the creation of {{FileSourceScanExec}} and not pushed down to the > data source. However, theoretically and practically, there is no reason to > exclude such filters from {{dataFilters}}. For any partitioned source data > file, the values of partitioning columns are the same for all rows. They can > be stored physically (or reconstructed logically) along with statistics for > non-partitioning columns to allow more powerful data skipping. If a data > source doesn't know how to handle such filters, it can simply ignore such > filters. > Example: Suppose that there is a table {{MYTAB}} with two columns {{A}} and > {{B}}, partitioned by {{A}} (Hive partitioning). Currently, data skipping > cannot be applied to queries like {{select * from MYTAB where A < B + 7}} > because {{A < B + 7}} is included in neither {{partitionFilters}} nor > {{dataFilters}}. However, we could have included the filter in > {{dataFilters}} because data sources have no obligation to use > {{dataFilters}} and they could have ignored filters that they cannot use. > It's not obvious whether we can change the semantics of > {{FileSourceScanExec.dataFilters}} without breaking existing code. It is > passed to {{FileIndex.listFiles}} and > {{FileFormat.buildReaderWithPartitionValues}} and the contracts for the > methods are not clear enough. > If we should not change {{dataFilters}}, we might have to add a new member > variable to {{FileSourceScanExec}} (e.g. {{dataFiltersWithPartitionColumns}}) > and add an overload of {{listFiles}} to the {{FileIndex}} trait, which > defaults to the existing {{listFiles}} without using the filters. Both > {{dataFilters}} and {{dataFiltersWIthoutPartitionColumns}} are optional; > implementations can ignore the filters if they can't utilize them. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org