[GitHub] [iceberg] singhpk234 opened a new issue #4188: [Spark] Support Runtime Filtering on Data Columns

GitBox Mon, 21 Feb 2022 23:16:03 -0800


singhpk234 opened a new issue #4188:
URL: https://github.com/apache/iceberg/issues/4188



   At present Iceberg only supports runtime Filtering on partition columns with 
**Apache Spark**, this issue proposes how we can achieve the runtime Filtering 
in data columns. Let’s say a runtime filter on data column is injected by spark 
to Iceberg at present we will simply evaluate it against partition columns and 
since the filter was present in data column level, hence a TrueLiteral will be 
returned.
   
   What we can do instead is we already have DataFile API’s to get lower / 
upper bounds per column, we can use this to evaluate runtime filter and filter 
out FileScanTask’s since the place where we apply runtime filtering we have 
have these scan tasks and within them the data file.
   
   ```
   Let’s say we get a filter like :
   
   dataColumn IN (1, 21, ....100)
   
   dataColumn range from DataFile1 (20 , 70)
   
   we can create filters such as : 
   - (1 >= 20  && 1 <= 70) || (21 >= 20 && 21 <= 70) || (100 >= 20 && 100 <= 70)
   
   ```
   
   even we can do one step optimization as parquet does when number of literals 
within IN subQuery is above certain threshold it converts the IN to a range 
query : (ref : 
[SPARK-32792](https://issues.apache.org/jira/browse/SPARK-32792)). 
   Note : Having a large IN set size can lead to performance degradation when 
the filter is evaluated at DS level
   
   ```
   - dataColumn IN (1, 21, 22 ,25, 30, 35...100)
   
   When inset size > certain threshold, converted to :
   
   - dataColumn >= 1 && dataColumn <= 100
   
   dataColumn range from DataFile1 (20 , 70)
   
   we can create filters such as : 
   - (20 >= 1 && 70 <= 100) || (20 >= 1 && 20 <= 100) || (70 >= 1 && 70 <= 100)
   
   ```
   
   Now to achieve this there is a pre-requisite that spark **should add dynamic 
filter on data columns** so that it gets percolated via DSV2 code path. There 
have been past attempts in spark to achieve the same. ref : 
[pull/30146](https://github.com/apache/spark/pull/30146)
   
   Pros : 
   
   * We will seamlessly support dynamic Pruning on data column when it’s 
introduced in spark
   * It will be good to have for non-partitioned table 
   
   Cons : 
   
   * The code added will be present as dead-code until this functionality is 
introduced in spark
   
   references : 
   [1] Trino : 
[[CodePointer](https://github.com/trinodb/trino/blob/master/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitSource.java#L213-L220)]
   [2] SparkDataFile API’s 
[[CodePointer](https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkDataFile.java#L145-L160)]
   [3] Impala : https://issues.apache.org/jira/browse/IMPALA-3654
   
   cc @rdblue @aokolnychyi @RussellSpitzer @jackye1995 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] singhpk234 opened a new issue #4188: [Spark] Support Runtime Filtering on Data Columns

Reply via email to