Dandandan commented on PR #5490: URL: https://github.com/apache/arrow-datafusion/pull/5490#issuecomment-1458962589
Nice PR! I think it would be great if we could run some benchmarks to show that we're not regressing too much (e.g. running tpch benchmark queries with joins). Some reasons I defaulted to initializing the hashmap using the size of the left side is as following: * The build side (for the partition) already has to be loaded into memory, and usually will at least as much and often more memory than the hash table * For many cases (e.g. unique identifiers) we need this capacity and the estimate is optimal * Rebuilding the hash table can be slow (although some improvements were made in this area) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
