On Tue, Jul 21, 2020 at 1:30 PM Bruce Momjian <br...@momjian.us> wrote: > On Tue, Jul 14, 2020 at 03:49:40PM -0700, Peter Geoghegan wrote: > > Maybe I missed your point here. The problem is not so much that we'll > > get HashAggs that spill -- there is nothing intrinsically wrong with > > that. While it's true that the I/O pattern is not as sequential as a > > similar group agg + sort, that doesn't seem like the really important > > factor here. The really important factor is that in-memory HashAggs > > can be blazingly fast relative to *any* alternative strategy -- be it > > a HashAgg that spills, or a group aggregate + sort that doesn't spill, > > whatever. We're mostly concerned about keeping the one available fast > > strategy than we are about getting a new, generally slow strategy. > > Do we have any data that in-memory HashAggs are "blazingly fast relative > to *any* alternative strategy?" The data I have tested myself and what > I saw from Tomas was that spilling sort or spilling hash are both 2.5x > slower. Are we sure the quoted statement is true?
I admit that I was unclear in the remarks you quote here. I placed too much emphasis on the precise cross-over point at which a hash agg that didn't spill in Postgres 12 spills now. That was important to Andres, who was concerned about the added I/O, especially with things like cloud providers [1] -- it's not desirable to go from no I/O to lots of I/O when upgrading, regardless of how fast your disks for temp files are. But that was not the point I was trying to make (though it's a good point, and one that I agree with). I'll take another shot at it. I'll use with Andres' test case in [1]. Specifically this query (I went with this example because it was convenient): SELECT a, array_agg(b) FROM (SELECT generate_series(1, 10000)) a(a), (SELECT generate_series(1, 10000)) b(b) GROUP BY a HAVING array_length(array_agg(b), 1) = 0; The planner generally prefers a hashagg here, though it's not a particularly sympathetic case for hash agg. For one thing the input to the sort is already sorted. For another, there isn't skew. But the planner seems to have it right, at least when everything fits in memory, because that takes ~17.6 seconds with a group agg + sort vs ~13.2 seconds with an in-memory hash agg. Importantly, hash agg's peak memory usage is 1443558kB (once we get to the point that no spilling is required), whereas for the sort we're using 7833229kB for the quicksort. Don't forget that in-memory hash agg is using ~5.4x less memory in this case on account of the way hash agg represents things. It's faster, and much much more efficient once you take a holistic view (i.e. something like work done per second per KB of memory). Clearly the precise "query operation spills" cross-over point isn't that relevant to query execution time (on my server with a fast nvme SSD), because if I give the sort 95% - 99% of the memory it needs to be an in-memory quicksort then it makes a noticeable difference, but not a huge difference. I get one big run and one tiny run in the tuplesort. The query itself takes ~23.4 seconds -- higher than 17.6 seconds, but not all that much higher considering we have to write and read ~7GB of data. If I try to do approximately the same thing with hash agg (give it very slightly less than optimal memory) I find that the difference is smaller -- it takes ~14.1 seconds (up from ~13.2 seconds). It looks like my original remarks are totally wrong so far, because it's as if the performance hit is entirely explainable as the extra temp file I/O (right down to the fact that hash agg takes a smaller hit because it has much less to write out to disk). But let's keep going. = Sort vs Hash = We'll focus on how the group agg + sort case behaves as we take memory away. What I notice is that it literally doesn't matter how much memory I take away any more (now that the sort has started to spill). I said that it was ~23.4 seconds when we have two runs, but if I keep taking memory away so that we get 10 runs it takes 23.2 seconds. If there are 36 runs it takes 22.8 seconds. And if there are 144 runs (work_mem is 50MB, down from the "optimal" required for the sort to be internal, ~7GB) then it takes 21.9 seconds. So it gets slightly faster, not slower. We really don't need very much memory to do the sort in one pass, and it pretty much doesn't matter how many runs we need to merge provided it doesn't get into the thousands, which is quite rare (when random I/O from multiple passes finally starts to bite). Now for hash agg -- this is where it gets interesting. If we give it about half the memory it needs (work_mem 700MB) we still have 4 batches and it hardly changes -- it takes 19.8 seconds, which is slower than the 4 batch case that took 14.1 seconds but not that surprising. 300MB still gets 4 batches which now takes ~23.5 seconds. 200MB gets 2424 batches and takes ~27.7 seconds -- a big jump! With 100MB it takes ~31.1 seconds (3340 batches). 50MB it's ~32.8 seconds (3591 batches). With 5MB it's ~33.8 seconds, and bizarrely has a drop in the number of batches to only 1028. If I put it down to 1MB it's ~40.7 seconds and has 5604 batches (the number of batches goes up again). (And yes, the planner still chooses a hash agg when work_mem is only 1MB.) = Observations = Hash aggs that have lots of memory (which could still be somewhat less than all the memory they could possibly make use of) *are* significantly faster in general, and particularly when you consider memory efficiency. They tend to use less memory but be much more sensitive to memory availability. And, we see really sharp discontinuities at certain points as memory is taken away: weird behavior around the number of batches, etc. Group agg + sort, in contrast, is slower initially/with more memory but remarkably insensitive to how much memory it gets, and remarkably predictable overall (it actually gets a bit faster with less memory here, but I think it's fair to assume no change once the tuplesort can merge all the runs produced in one merge pass). The important point here for me is that a hash agg will almost always benefit from more memory (until everything fits in a single batch and we don't spill at all) -- even a small additional amount consistently makes a difference. Whereas there is a huge "memory availability range" for the sort where it just does not make a bit of difference. We are highly incentivized to give hash agg more memory in general, because it's bound to be faster that way (and usually uses *less* memory, as we see here). But it's not just faster -- it's also more predictable and insensitive to an underestimate of the number of groupings. It can therefore ameliorate the problem we have here with users depending on Postgres 12 fast in-memory hash aggregates, without any escape hatch kludges. I don't think that the hash_mem_multiplier patch deserves "extra credit" for being generally useful. But I also don't think that it should be punished for it. [1] https://postgr.es/m/20200625203629.7m6yvut7eqblg...@alap3.anarazel.de -- Peter Geoghegan