aviralgarg05 commented on issue #19858: URL: https://github.com/apache/datafusion/issues/19858#issuecomment-3768758076
I completely agreeālet's keep the guesswork out of it and stick to the data. A scientific breakdown is exactly what's needed here. To prove/disprove the "evaluation tax" hypothesis, we can collect metrics on the specific TPC-H queries that are regressing. I'd suggest we track: 1. **Selectivity Impact**: Add metrics to track `rows_pruned` versus `total_rows_scanned` by the dynamic filter. If we're seeing <1% pruning in the slow queries, we have our answer on why the overhead isn't worth it. 2. **Timing Breakdown**: Instrument the probe-side filter to measure `eval_time`. Comparing this against the total `join_time` will show if the expression engine is the primary bottleneck. 3. **Threshold Testing**: Run a matrix of SF1 vs SF10. My suspicion is that at SF1, the constant overhead of creating and evaluating the filter dominates, whereas at SF10+, the I/O savings finally make it "profitable." I can help put together a quick manual test plan or a PR to instrument these specific metrics if that helps move the needle. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
