[
https://issues.apache.org/jira/browse/SPARK-37172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17437135#comment-17437135
]
Chungmin commented on SPARK-37172:
----------------------------------
I can work on this if the rationale seems okay.
> Push down filters having both partitioning and non-partitioning columns
> -----------------------------------------------------------------------
>
> Key: SPARK-37172
> URL: https://issues.apache.org/jira/browse/SPARK-37172
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.2.0
> Reporter: Chungmin
> Priority: Major
>
> Currently, filters having both partitioning and non-partitioning columns are
> lost during the creation of {{FileSourceScanExec}} and not pushed down to the
> data source. However, theoretically and practically, there is no reason to
> exclude such filters from {{dataFilters}}. For any partitioned source data
> file, the values of partitioning columns are the same for all rows. They can
> be stored physically (or reconstructed logically) along with statistics for
> non-partitioning columns to allow more powerful data skipping. If a data
> source doesn't know how to handle such filters, it can simply ignore such
> filters.
> Example: Suppose that there is a table {{MYTAB}} with two columns {{A}} and
> {{B}}, partitioned by {{A}} (Hive partitioning). Currently, data skipping
> cannot be applied to queries like {{select * from MYTAB where A < B + 7}}
> because {{A < B + 7}} is included in neither {{partitionFilters}} nor
> {{dataFilters}}. However, we could have included the filter in
> {{dataFilters}} because data sources have no obligation to use
> {{dataFilters}} and they could have ignored filters that they cannot use.
> It's not obvious whether we can change the semantics of
> {{FileSourceScanExec.dataFilters}} without breaking existing code. It is
> passed to {{FileIndex.listFiles}} and
> {{FileFormat.buildReaderWithPartitionValues}} and the contracts for the
> methods are not clear enough.
> If we should not change {{dataFilters}}, we might have to add a new member
> variable to {{FileSourceScanExec}} (e.g. {{dataFiltersWithPartitionColumns}})
> and add an overload of {{listFiles}} to the {{FileIndex}} trait, which
> defaults to the existing {{listFiles}} without using the filters. Both
> {{dataFilters}} and {{dataFiltersWIthoutPartitionColumns}} are optional;
> implementations can ignore the filters if they can't utilize them.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]