Re: [PR] feat: selectivity metrics (for Explain Analyze) in Hash Join [datafusion]

via GitHub Thu, 06 Nov 2025 19:12:27 -0800


2010YOUY01 commented on PR #18488:
URL: https://github.com/apache/datafusion/pull/18488#issuecomment-3500400479


   > > It's running in a noisy cloud environment, and tpch_mem takes quite 
short time, so it might not be accurate.
   > 
   > Very interesting
   > 
   > > I’ve verified this with tpch_mem10 locally, and it actually slows down 
several queries.
   > 
   > Thanks for confirming!
   > 
   > > I tried to make this count distinct indices faster (and sent a PR 
[feniljain#2](https://github.com/feniljain/datafusion/pull/2)),
   > 
   > Very curious about this PR, cause it seems you have written the same logic 
as mine, but using loops instead, am I missing some detail?
   > 
   > Is it that the `None` check, which is causing all this overhead?
   
   If we can make the loop body really simple, the compiler can figure out how 
to generate more efficient machine code like using SIMD instructions, and the 
hardware can execute faster through several mechanisms (e.g. better memory 
prefetching), this can result in several times of speed up for the equivalent 
implementations.
   
   I'm not entirely sure under which circumstances the compiler might fail to 
optimize, so I try to keep the loop body as simple as possible — and that 
usually works well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: selectivity metrics (for Explain Analyze) in Hash Join [datafusion]

Reply via email to