gene-bordegaray commented on code in PR #20331:
URL: https://github.com/apache/datafusion/pull/20331#discussion_r2823786180
##########
datafusion/physical-expr/src/expressions/dynamic_filters.rs:
##########
@@ -450,6 +531,25 @@ impl PhysicalExpr for DynamicFilterPhysicalExpr {
}
}
+/// Snapshot a `PhysicalExpr` tree, replacing any
[`DynamicFilterPhysicalExpr`] that
+/// has per-partition data with its partition-specific filter expression.
+/// If a `DynamicFilterPhysicalExpr` does not have partitioned data, it is
left unchanged.
+pub fn snapshot_physical_expr_for_partition(
+ expr: Arc<dyn PhysicalExpr>,
+ partition: usize,
+) -> Result<Arc<dyn PhysicalExpr>> {
Review Comment:
> This is an interesting way to go about it. It does kind of feel to me like
we are trying to shoehorn the concept of partitions into `PhysicalExpr` through
the `snapshot()` API which was never intended for this use case. I feel there
is probably a much cleaner way to introduce the concept of a partition to
`PhysicalExpr`
I agree that overloading the `snapshot()` is a bit odd. Here are the two
ways this can be improved upon:
**1. Transmit Partition Info in the Data:**
This is the suggestion @adriangb provided. I think it is and interesting
approach but has large implications. Injecting a `partition-id` or something
similar would mean changing projections, filter handling etc. With the
objective of this PR being to enable dynamic filtering for file preserved
partitioning, this seems like too large of a change.
**2. Runtime Partition Bind:**
Introduce a runtime wrapper `PartitionBoundDynamicFilterPhysicalExpr`. The
parquet opener binds the dynamic filters to partition `i` by inserting this new
wrapper rather than a static snapshot. Then, this wrapper is resolved at
evaluation time. This largely avoids the shoehorning partitioning in
`PhysicalExpr` through `snapshot()`.
With this said, I am aware this approach is not the cleanest solution for
this issue, but it does set us up for follow-up work. This can become a generic
runtime bind API that hooks on `PhysicalExpr`, by default we will have it be a
no-op. Then for `DynamicFilterPhysicalExpr` we will implement the hook and
eliminate our wrapper. This will provide support for any future
runtime-dependent expressions without the ad-hoc plumbing we did before and no
schema / data changes.
I decided to leave the full generic solution our of this PR to keep these
changes as local as I can, but tried to lay a path forward that is
maintainable. Let me know thoughts on this approach.
cc: @adriangb @LiaCastaneda
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]