alamb commented on issue #5646: URL: https://github.com/apache/arrow-datafusion/issues/5646#issuecomment-1628654865
BTW I think the reason DF's memory usage is increasing with number of cores is because the first partial aggregate phase is using RoundRobin repartitioning (and thus each hash table has an entry for all the groups). To avoid this, we would need to hash repartition the input based on group keys so the different partitions saw different subsets of the group keys -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
