aviralgarg05 commented on issue #19858:
URL: https://github.com/apache/datafusion/issues/19858#issuecomment-3768758076

   I completely agree—let's keep the guesswork out of it and stick to the data. 
A scientific breakdown is exactly what's needed here.
   To prove/disprove the "evaluation tax" hypothesis, we can collect metrics on 
the specific TPC-H queries that are regressing. I'd suggest we track:
   1.  **Selectivity Impact**: Add metrics to track `rows_pruned` versus 
`total_rows_scanned` by the dynamic filter. If we're seeing <1% pruning in the 
slow queries, we have our answer on why the overhead isn't worth it.
   2.  **Timing Breakdown**: Instrument the probe-side filter to measure 
`eval_time`. Comparing this against the total `join_time` will show if the 
expression engine is the primary bottleneck.
   3.  **Threshold Testing**: Run a matrix of SF1 vs SF10. My suspicion is that 
at SF1, the constant overhead of creating and evaluating the filter dominates, 
whereas at SF10+, the I/O savings finally make it "profitable."
   I can help put together a quick manual test plan or a PR to instrument these 
specific metrics if that helps move the needle.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to