On Tue, 2018-06-05 at 05:42 -0700, Andres Freund wrote: > That's an interesting idea, but it seems simpler to stick to > > hashing > > rather than using a combination strategy. It also seems like it > > would > > take less CPU effort. > Isn't the locality of access going to considerably better with the > sort > based approach?
I think I see what you mean, but without measuring it's hard to say. You may be able to achieve similar locality in the hashing approach by using a smaller hash table -- an effect that I think I observed previously but I'd have to rerun the numbers. > > > > What advantages do you have in mind? My patch partitions the > > spilled > > data, so it should have similar disk costs as a sort approach. > I think one part of it is that I think the amount of code is going to > be > lower - we essentially have already all the code to handle sort based > aggs, and to have both sort and hash based aggs in the same query. > We'd > mostly need a way to scan the hashtable and stuff it into a > tuplesort, > that's not hard. nodeAgg.c is already more than complex enough, I'm > not > sure that full blown partitioning is worth the cost. The thing I like about your idea is that we wouldn't need to make a choice at plan time, we just always do the hybrid hashing/sorting unless we know that it's already sorted. The thing I don't like about it is that it requires running two memory- hungry operations at once. How much of work_mem do we use for sorted runs, and how much do we use for the hash table? Regards, Jeff Davis