Re: [I] TPCH q1 with no predicates is 2x slower than duckdb [datafusion]

via GitHub Wed, 17 Dec 2025 04:47:55 -0800


alamb commented on issue #18411:
URL: https://github.com/apache/datafusion/issues/18411#issuecomment-3665195440


   That being said, one idea I had about optimizing the case of "all short 
strings" (aka all strings that fit in 12 bytes or less views) I do think we 
could have the group values implementation special case short strings
   1. If all strings in the input array were short (no data buffers) stored 
them as a HashSet(u128) (aka stored the values directly)
   
   If a new batch arrived that had longer strings, then we would have to 
fallback to the current implementation that stored data buffers. 
   
   
   It would certainly help this query and I could definitely see it being 
helpful for real queries on short string columns 🤔 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] TPCH q1 with no predicates is 2x slower than duckdb [datafusion]

Reply via email to