Re: [PR] feat: selectivity metrics (for Explain Analyze) in Hash Join [datafusion]

via GitHub Wed, 05 Nov 2025 01:05:03 -0800


2010YOUY01 commented on PR #18488:
URL: https://github.com/apache/datafusion/pull/18488#issuecomment-3490039882


   Thank you, I think the implementation is correct.
   
   The only consideration is performance, Hash Join implementation is 
definitely on the performance critical path, so we have to be careful not to 
introduce additional overhead. This PR should be good to go if we can verify it 
has no influence on the performance.
   
   In this PR, the extra overhead is for each batch, count the `distinct_count` 
for a sorted vector like [0,1,1,2,2,2...] up to batch size long, it seem 
shouldn't be the bottleneck. (@alamb Could you help trigger then benchmark 
please?)
   
   I believe these metrics provide more insight than simply computing 
`output_rows / input_rows` for equal joins. However, if they introduce 
noticeable overhead, we can move them under `ExplainAnalyzeLevel::Dev`, and 
track them only when this extra-verbose level is enabled. We should also 
document that more detailed analyze levels may incur additional execution 
overhead but offer deeper insights.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: selectivity metrics (for Explain Analyze) in Hash Join [datafusion]

Reply via email to