2010YOUY01 commented on PR #18488: URL: https://github.com/apache/datafusion/pull/18488#issuecomment-3490039882
Thank you, I think the implementation is correct. The only consideration is performance, Hash Join implementation is definitely on the performance critical path, so we have to be careful not to introduce additional overhead. This PR should be good to go if we can verify it has no influence on the performance. In this PR, the extra overhead is for each batch, count the `distinct_count` for a sorted vector like [0,1,1,2,2,2...] up to batch size long, it seem shouldn't be the bottleneck. (@alamb Could you help trigger then benchmark please?) I believe these metrics provide more insight than simply computing `output_rows / input_rows` for equal joins. However, if they introduce noticeable overhead, we can move them under `ExplainAnalyzeLevel::Dev`, and track them only when this extra-verbose level is enabled. We should also document that more detailed analyze levels may incur additional execution overhead but offer deeper insights. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
