aokolnychyi commented on pull request #35395:
URL: https://github.com/apache/spark/pull/35395#issuecomment-1072693531
Alright, there are multiple ways to support a separate scan builder for
runtime filtering.
One way could be to expose a mix-in interface for `RowLevelOperation` with
the following method (naming TBD):
```
ScanBuilder newRuntimeFilterScanBuilder(CaseInsensitiveStringMap options);
```
Then we can have an optimizer rule what will be applied after the main
`Scan` for the row-level operation is built. That rule will catch row-level
operations that implement the mix-in interface and where the main `Scan`
extends `SupportsRuntimeFitlering`. The main `Scan` would tell us filter
attributes in `filterAttributes`. During runtime filtering, we will collect
unique values for these attributes and pass them back to the main `Scan` via
`filter(Filter[] filters)`. The rule can construct a filter subquery that
reference `DataSourceV2Relation`. We will have to call `OptimizeSubquery` on
it to rewrite predicate subqueries as joins anyway so we can leverage
`V2ScanRelationPushDown` to do the planning that would use the runtime filter
scan builder.
This will enable schema pruning and filter pushdown within groups during
runtime filtering.
```
== Optimized Logical Plan ==
ReplaceData RelationV2[id#135, dep#136] testhadoop.default.table
+- Project [id#135, dep#136]
+- Filter NOT (exists#157 <=> true)
+- Join ExistenceJoin(exists#157), (id#135 = value#46)
:- Project [id#135, dep#136]
: +- Filter dynamicpruning#154 [_file_name#139]
: : +- Project [_file_name#152]
: : +- Join LeftSemi, (id#150 = value#141)
: : :- RelationV2[id#150, _file_name#152]
testhadoop.default.table
: : +- LocalRelation [value#141]
: +- RelationV2[id#135, dep#136, _file_name#139]
testhadoop.default.table
+- LocalRelation [value#46]
```
In the example above, the scan relation in the runtime filter projects only
`id` and `_file_name`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]