camuel commented on issue #18341: URL: https://github.com/apache/datafusion/issues/18341#issuecomment-3472460488
The slow down is there even with a very simplified query without any predicates. just projection and count agg and that's it, x3.5 slower than untuned DuckDB out of the box with parquets generated by DataFusion. From my profiling and experimentation it looks like only happens with dictionary encoded strings which both fields (l_returnflag, l_linestatus) seems to be. First of all in attached profiling screenshot it can be seen that 33% is spent in hashbrown's hashtable and another 31% is spent in create_hashes in hashutils. I reran the *simplified* tpch sf100 q1 query on few integer fields instead (and divided it to keep carnality same low) and while it was still slower than DuckDB it was not that much of a difference. I have all the setup fresh on my end and can answer questions to help troubleshoot it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
