gene-bordegaray commented on code in PR #20331:
URL: https://github.com/apache/datafusion/pull/20331#discussion_r2823786180


##########
datafusion/physical-expr/src/expressions/dynamic_filters.rs:
##########
@@ -450,6 +531,25 @@ impl PhysicalExpr for DynamicFilterPhysicalExpr {
     }
 }
 
+/// Snapshot a `PhysicalExpr` tree, replacing any 
[`DynamicFilterPhysicalExpr`] that
+/// has per-partition data with its partition-specific filter expression.
+/// If a `DynamicFilterPhysicalExpr` does not have partitioned data, it is 
left unchanged.
+pub fn snapshot_physical_expr_for_partition(
+    expr: Arc<dyn PhysicalExpr>,
+    partition: usize,
+) -> Result<Arc<dyn PhysicalExpr>> {

Review Comment:
   > This is an interesting way to go about it. It does kind of feel to me like 
we are trying to shoehorn the concept of partitions into `PhysicalExpr` through 
the `snapshot()` API which was never intended for this use case. I feel there 
is probably a much cleaner way to introduce the concept of a partition to 
`PhysicalExpr`
   I agree that overloading the `snapshot()` is a bit odd. Here are the two 
ways this can be improved upon:
   
   **1. Transmit Partition Info in the Data:**
   This is the suggestion @adriangb provided. I think it is and interesting 
approach but has large implications. Injecting a `partition-id` or something 
similar would mean changing projections, filter handling etc. With the 
objective of this PR being to enable dynamic filtering for file preserved 
partitioning, this seems like too large of a change.
   
   **2. Runtime Partition Bind:**
   Introduce a runtime wrapper `PartitionBoundDynamicFilterPhysicalExpr`. The 
parquet opener binds the dynamic filters to partition `i` by inserting this new 
wrapper rather than a static snapshot. Then, this wrapper is resolved at 
evaluation time. This largely avoids the shoehorning partitioning in 
`PhysicalExpr` through `snapshot()`.
   
   With this said, I am aware this approach is not the cleanest solution for 
this issue, but it does set us up for follow-up work. This can become a generic 
runtime bind API that hooks on `PhysicalExpr`, by default we will have it be a 
no-op. Then for `DynamicFilterPhysicalExpr` we will implement the hook and 
eliminate our wrapper. This will provide support for any future 
runtime-dependent expressions without the ad-hoc plumbing we did before and no 
schema / data changes.
   
   I decided to leave the full generic solution our of this PR to keep these 
changes as local as I can, but tried to lay a path forward that is 
maintainable. Let me know thoughts on this approach.
   
   cc: @adriangb @LiaCastaneda 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to