aviralgarg05 commented on issue #19858:
URL: https://github.com/apache/datafusion/issues/19858#issuecomment-3763775497
My bet is that we’re seeing the **cost of evaluation on the probe side
(2)**, specifically in cases where the dynamic filter has low selectivity. If
the filter (like a min/max or Bloom filter) doesn’t strictly prune a
significant amount of data, we end up paying a per-row evaluation “tax” without
gaining any performance benefit from smaller join inputs.
To nail this down, I’d propose:
1. **Isolate the Bottleneck:** Can we run a test where we *compute* the
dynamic filters on the build side but **do not apply them on the probe side**?
- If it’s still slow, the issue is the **creation overhead** (or
synchronization blocking the probe side).
- If it speeds up, the issue is definitely the **evaluation cost**.
2. **Verify Selectivity:** I suspect for the slow TPC-H queries, the filter
is pruning very few rows.
3. **Potential Solution:** If it *is* the evaluation cost, we might need an
**adaptive approach**. We could track the filter’s hit rate at runtime; if it’s
ineffective (e.g., pruning < 10% of rows) after the first few batches, we
should dynamically disable it to stop the bleeding.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]