adriangb commented on code in PR #18938:
URL: https://github.com/apache/datafusion/pull/18938#discussion_r2578967858
##########
datafusion/physical-plan/src/joins/hash_join/exec.rs:
##########
@@ -1159,34 +1164,38 @@ impl ExecutionPlan for HashJoinExec {
let right_child_self_filters = &child_pushdown_result.self_filters[1];
// We only push down filters to the right child
// We expect 0 or 1 self filters
if let Some(filter) = right_child_self_filters.first() {
- // Note that we don't check PushdDownPredicate::discrimnant
because even if nothing said
- // "yes, I can fully evaluate this filter" things might still use
it for statistics -> it's worth updating
- let predicate = Arc::clone(&filter.predicate);
- if let Ok(dynamic_filter) =
- Arc::downcast::<DynamicFilterPhysicalExpr>(predicate)
- {
- // We successfully pushed down our self filter - we need to
make a new node with the dynamic filter
- let new_node = Arc::new(HashJoinExec {
- left: Arc::clone(&self.left),
- right: Arc::clone(&self.right),
- on: self.on.clone(),
- filter: self.filter.clone(),
- join_type: self.join_type,
- join_schema: Arc::clone(&self.join_schema),
- left_fut: Arc::clone(&self.left_fut),
- random_state: self.random_state.clone(),
- mode: self.mode,
- metrics: ExecutionPlanMetricsSet::new(),
- projection: self.projection.clone(),
- column_indices: self.column_indices.clone(),
- null_equality: self.null_equality,
- cache: self.cache.clone(),
- dynamic_filter: Some(HashJoinExecDynamicFilter {
- filter: dynamic_filter,
- build_accumulator: OnceLock::new(),
- }),
- });
- result = result.with_updated_node(new_node as Arc<dyn
ExecutionPlan>);
+ // Only create the dynamic filter if the probe side will actually
use it (Exact or Inexact).
+ // If it's Unsupported, don't compute the filter since it won't be
used.
+ let will_be_used = !matches!(filter.discriminant,
PushedDown::Unsupported);
Review Comment:
I think these are both good reasons to make this API change. Maybe we can
justify the change by doing what you are saying and skipping pushing down the
entire hash table if it won't be used? But then again that is basically free...
and where do bloom filters fall into this calculation? i.e. bloom filter
pruning only works if HashJoinExec produces an InListExpr *and* the scan node
is a Parquet node (or other format that supports bloom filters). That seems
like an awful lot of complex coordination between the producer/consumer that is
specific to each file (some may have bloom filters some don't) and the filter
being pushed down (min/max vs. InList vs. Hash Table).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]