[jira] [Commented] (SPARK-37172) Push down filters having both partitioning and non-partitioning columns

Chungmin (Jira) Mon, 01 Nov 2021 21:32:05 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-37172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17437135#comment-17437135
 ]


Chungmin commented on SPARK-37172:
----------------------------------

I can work on this if the rationale seems okay.

> Push down filters having both partitioning and non-partitioning columns
> -----------------------------------------------------------------------
>
>                 Key: SPARK-37172
>                 URL: https://issues.apache.org/jira/browse/SPARK-37172
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.2.0
>            Reporter: Chungmin
>            Priority: Major
>
> Currently, filters having both partitioning and non-partitioning columns are 
> lost during the creation of {{FileSourceScanExec}} and not pushed down to the 
> data source. However, theoretically and practically, there is no reason to 
> exclude such filters from {{dataFilters}}. For any partitioned source data 
> file, the values of partitioning columns are the same for all rows. They can 
> be stored physically (or reconstructed logically) along with statistics for 
> non-partitioning columns to allow more powerful data skipping. If a data 
> source doesn't know how to handle such filters, it can simply ignore such 
> filters.
> Example: Suppose that there is a table {{MYTAB}} with two columns {{A}} and 
> {{B}}, partitioned by {{A}} (Hive partitioning). Currently, data skipping 
> cannot be applied to queries like {{select * from MYTAB where A < B + 7}} 
> because {{A < B + 7}} is included in neither {{partitionFilters}} nor 
> {{dataFilters}}. However, we could have included the filter in 
> {{dataFilters}} because data sources have no obligation to use 
> {{dataFilters}} and they could have ignored filters that they cannot use.
> It's not obvious whether we can change the semantics of 
> {{FileSourceScanExec.dataFilters}} without breaking existing code. It is 
> passed to {{FileIndex.listFiles}} and 
> {{FileFormat.buildReaderWithPartitionValues}} and the contracts for the 
> methods are not clear enough.
> If we should not change {{dataFilters}}, we might have to add a new member 
> variable to {{FileSourceScanExec}} (e.g. {{dataFiltersWithPartitionColumns}}) 
> and add an overload of {{listFiles}} to the {{FileIndex}} trait, which 
> defaults to the existing {{listFiles}} without using the filters. Both 
> {{dataFilters}} and {{dataFiltersWIthoutPartitionColumns}} are optional; 
> implementations can ignore the filters if they can't utilize them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-37172) Push down filters having both partitioning and non-partitioning columns

Reply via email to