Re: Default setting for enable_hashagg_disk

Peter Geoghegan Wed, 22 Jul 2020 19:55:15 -0700

On Tue, Jul 21, 2020 at 1:30 PM Bruce Momjian <br...@momjian.us> wrote:
> On Tue, Jul 14, 2020 at 03:49:40PM -0700, Peter Geoghegan wrote:
> > Maybe I missed your point here. The problem is not so much that we'll
> > get HashAggs that spill -- there is nothing intrinsically wrong with
> > that. While it's true that the I/O pattern is not as sequential as a
> > similar group agg + sort, that doesn't seem like the really important
> > factor here. The really important factor is that in-memory HashAggs
> > can be blazingly fast relative to *any* alternative strategy -- be it
> > a HashAgg that spills, or a group aggregate + sort that doesn't spill,
> > whatever. We're mostly concerned about keeping the one available fast
> > strategy than we are about getting a new, generally slow strategy.
>
> Do we have any data that in-memory HashAggs are "blazingly fast relative
> to *any* alternative strategy?"  The data I have tested myself and what
> I saw from Tomas was that spilling sort or spilling hash are both 2.5x
> slower.  Are we sure the quoted statement is true?


I admit that I was unclear in the remarks you quote here. I placed too
much emphasis on the precise cross-over point at which a hash agg that
didn't spill in Postgres 12 spills now. That was important to Andres,
who was concerned about the added I/O, especially with things like
cloud providers [1] -- it's not desirable to go from no I/O to lots of
I/O when upgrading, regardless of how fast your disks for temp files
are. But that was not the point I was trying to make (though it's a
good point, and one that I agree with).

I'll take another shot at it. I'll use with Andres' test case in [1].
Specifically this query (I went with this example because it was
convenient):

SELECT a, array_agg(b) FROM (SELECT generate_series(1, 10000)) a(a),
(SELECT generate_series(1, 10000)) b(b) GROUP BY a HAVING
array_length(array_agg(b), 1) = 0;

The planner generally prefers a hashagg here, though it's not a
particularly sympathetic case for hash agg. For one thing the input to
the sort is already sorted. For another, there isn't skew. But the
planner seems to have it right, at least when everything fits in
memory, because that takes ~17.6 seconds with a group agg + sort vs
~13.2 seconds with an in-memory hash agg. Importantly, hash agg's peak
memory usage is 1443558kB (once we get to the point that no spilling
is required), whereas for the sort we're using 7833229kB for the
quicksort. Don't forget that in-memory hash agg is using ~5.4x less
memory in this case on account of the way hash agg represents things.
It's faster, and much much more efficient once you take a holistic
view (i.e. something like work done per second per KB of memory).

Clearly the precise "query operation spills" cross-over point isn't
that relevant to query execution time (on my server with a fast nvme
SSD), because if I give the sort 95% - 99% of the memory it needs to
be an in-memory quicksort then it makes a noticeable difference, but
not a huge difference. I get one big run and one tiny run in the
tuplesort. The query itself takes ~23.4 seconds -- higher than 17.6
seconds, but not all that much higher considering we have to write and
read ~7GB of data. If I try to do approximately the same thing with
hash agg (give it very slightly less than optimal memory) I find that
the difference is smaller -- it takes ~14.1 seconds (up from ~13.2
seconds). It looks like my original remarks are totally wrong so far,
because it's as if the performance hit is entirely explainable as the
extra temp file I/O (right down to the fact that hash agg takes a
smaller hit because it has much less to write out to disk). But let's
keep going.

= Sort vs Hash =

We'll focus on how the group agg + sort case behaves as we take memory
away. What I notice is that it literally doesn't matter how much
memory I take away any more (now that the sort has started to spill).
I said that it was ~23.4 seconds when we have two runs, but if I keep
taking memory away so that we get 10 runs it takes 23.2 seconds. If
there are 36 runs it takes 22.8 seconds. And if there are 144 runs
(work_mem is 50MB, down from the "optimal" required for the sort to be
internal, ~7GB) then it takes 21.9 seconds. So it gets slightly
faster, not slower. We really don't need very much memory to do the
sort in one pass, and it pretty much doesn't matter how many runs we
need to merge provided it doesn't get into the thousands, which is
quite rare (when random I/O from multiple passes finally starts to
bite).

Now for hash agg -- this is where it gets interesting. If we give it
about half the memory it needs (work_mem 700MB) we still have 4
batches and it hardly changes -- it takes 19.8 seconds, which is
slower than the 4 batch case that took 14.1 seconds but not that
surprising. 300MB still gets 4 batches which now takes ~23.5 seconds.
200MB gets 2424 batches and takes ~27.7 seconds -- a big jump! With
100MB it takes ~31.1 seconds (3340 batches). 50MB it's ~32.8 seconds
(3591 batches). With 5MB it's ~33.8 seconds, and bizarrely has a drop
in the number of batches to only 1028. If I put it down to 1MB it's
~40.7 seconds and has 5604 batches (the number of batches goes up
again). (And yes, the planner still chooses a hash agg when work_mem
is only 1MB.)

= Observations =

Hash aggs that have lots of memory (which could still be somewhat less
than all the memory they could possibly make use of) *are*
significantly faster in general, and particularly when you consider
memory efficiency. They tend to use less memory but be much more
sensitive to memory availability. And, we see really sharp
discontinuities at certain points as memory is taken away: weird
behavior around the number of batches, etc. Group agg + sort, in
contrast, is slower initially/with more memory but remarkably
insensitive to how much memory it gets, and remarkably predictable
overall (it actually gets a bit faster with less memory here, but I
think it's fair to assume no change once the tuplesort can merge all
the runs produced in one merge pass).

The important point here for me is that a hash agg will almost always
benefit from more memory (until everything fits in a single batch and
we don't spill at all) -- even a small additional amount consistently
makes a difference. Whereas there is a huge "memory availability
range" for the sort where it just does not make a bit of difference.
We are highly incentivized to give hash agg more memory in general,
because it's bound to be faster that way (and usually uses *less*
memory, as we see here). But it's not just faster -- it's also more
predictable and insensitive to an underestimate of the number of
groupings. It can therefore ameliorate the problem we have here with
users depending on Postgres 12 fast in-memory hash aggregates, without
any escape hatch kludges.

I don't think that the hash_mem_multiplier patch deserves "extra
credit" for being generally useful. But I also don't think that it
should be punished for it.

[1] https://postgr.es/m/20200625203629.7m6yvut7eqblg...@alap3.anarazel.de
-- 
Peter Geoghegan

Re: Default setting for enable_hashagg_disk

Reply via email to