singhpk234 opened a new issue #4188: URL: https://github.com/apache/iceberg/issues/4188
At present Iceberg only supports runtime Filtering on partition columns with **Apache Spark**, this issue proposes how we can achieve the runtime Filtering in data columns. Let’s say a runtime filter on data column is injected by spark to Iceberg at present we will simply evaluate it against partition columns and since the filter was present in data column level, hence a TrueLiteral will be returned. What we can do instead is we already have DataFile API’s to get lower / upper bounds per column, we can use this to evaluate runtime filter and filter out FileScanTask’s since the place where we apply runtime filtering we have have these scan tasks and within them the data file. ``` Let’s say we get a filter like : dataColumn IN (1, 21, ....100) dataColumn range from DataFile1 (20 , 70) we can create filters such as : - (1 >= 20 && 1 <= 70) || (21 >= 20 && 21 <= 70) || (100 >= 20 && 100 <= 70) ``` even we can do one step optimization as parquet does when number of literals within IN subQuery is above certain threshold it converts the IN to a range query : (ref : [SPARK-32792](https://issues.apache.org/jira/browse/SPARK-32792)). Note : Having a large IN set size can lead to performance degradation when the filter is evaluated at DS level ``` - dataColumn IN (1, 21, 22 ,25, 30, 35...100) When inset size > certain threshold, converted to : - dataColumn >= 1 && dataColumn <= 100 dataColumn range from DataFile1 (20 , 70) we can create filters such as : - (20 >= 1 && 70 <= 100) || (20 >= 1 && 20 <= 100) || (70 >= 1 && 70 <= 100) ``` Now to achieve this there is a pre-requisite that spark **should add dynamic filter on data columns** so that it gets percolated via DSV2 code path. There have been past attempts in spark to achieve the same. ref : [pull/30146](https://github.com/apache/spark/pull/30146) Pros : * We will seamlessly support dynamic Pruning on data column when it’s introduced in spark * It will be good to have for non-partitioned table Cons : * The code added will be present as dead-code until this functionality is introduced in spark references : [1] Trino : [[CodePointer](https://github.com/trinodb/trino/blob/master/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitSource.java#L213-L220)] [2] SparkDataFile API’s [[CodePointer](https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkDataFile.java#L145-L160)] [3] Impala : https://issues.apache.org/jira/browse/IMPALA-3654 cc @rdblue @aokolnychyi @RussellSpitzer @jackye1995 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
