Re: [I] Avoid consecutive RepartitionExec [datafusion]

via GitHub Fri, 31 Oct 2025 03:44:52 -0700


camuel commented on issue #18341:
URL: https://github.com/apache/datafusion/issues/18341#issuecomment-3472460488


   The slow down is there even with a very simplified query without any 
predicates. just projection and count agg and that's it, x3.5 slower than 
untuned DuckDB out of the box with parquets generated by DataFusion. From my 
profiling and experimentation it looks like only happens with dictionary 
encoded strings which both fields (l_returnflag, l_linestatus) seems to be. 
First of all in attached profiling screenshot it can be seen that 33% is spent 
in hashbrown's hashtable and another 31% is spent in create_hashes in 
hashutils. I reran the *simplified* tpch sf100 q1 query on few integer fields 
instead (and divided it to keep carnality same low)  and while it was still 
slower than DuckDB it was not that much of a difference. I have all the setup 
fresh on my end and can answer questions to help troubleshoot it 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Avoid consecutive RepartitionExec [datafusion]

Reply via email to