Dandandan commented on issue #20773: URL: https://github.com/apache/datafusion/issues/20773#issuecomment-4017058606
> We have tried the similar approach in datafusion before, see https://github.com/apache/datafusion/issues/6937#issuecomment-1681310199 , but found no obvious improvement. Note that this is significantly different: The algorithm used by DuckDB/in the paper: * inserts into the hashmap *partition by partition* / *hashmap by hashmap* (I think this is the most important point) - this will make the partial aggregation much more cache efficient, even for aggregations that are not *that* big as the hashmap that is "in progress" will more likely fit in cache / lookups are likely to be in cache. * Avoids the extra partitioning step (both the double hashing as the copying) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
