Re: Default setting for enable_hashagg_disk

Stephen Frost Mon, 13 Jul 2020 06:14:16 -0700

Greetings,

* Peter Geoghegan (p...@bowt.ie) wrote:
> On Sat, Jul 11, 2020 at 7:22 AM Stephen Frost <sfr...@snowman.net> wrote:
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> > > Have you got a better proposal that is reasonably implementable for v13?
> > > (I do not accept the argument that "do nothing" is a better proposal.)
> 
> > So, no, I don't agree that 'do nothing' (except ripping out the one GUC
> > that was already added) is a worse proposal than adding another work_mem
> > like thing that's only for some nodes types.
> 
> The question was "Have you got a better proposal that is reasonably
> implementable for v13?".
> 
> This is anecdotal, but just today somebody on Twitter reported
> *increasing* work_mem to stop getting OOMs from group aggregate +
> sort:
> 
> https://twitter.com/theDressler/status/1281942941133615104


Yes, increasing work_mem isn't unusual, at all.  What that tweet shows
that I don't think folks who are suggesting things like setting this
factor to 2.0 is that people may have a work_mem configured in the
gigabytes- meaning that a 2.0 value would result in a work_mem of 5GB
and a hash_mem of 10GB.  Now, I'm all for telling people to review their
configurations between major versions, but that's a large difference
that's going to be pretty deeply hidden in a 'multiplier' setting.

I'm still wholly unconvinced that we need such a setting, just to be
clear, but I don't think there's any way it'd be reasonable to have it
set to something other than "whatever work_mem is" by default- and it
needs to actually be "what work_mem is" and not "have the same default
value" or *everyone* would have to configure it.

> It was possible to fix the problem in this instance, since evidently
> there wasn't anything else that really did try to consume ~5 GB of
> work_mem memory. Evidently the memory isn't available in any general
> sense, so there are no OOMs now. Nevertheless, we can expect OOMs on
> this server just as soon as there is a real need to do a ~5GB sort,
> regardless of anything else.

Eh?  That's not at all what it looks like- they were getting OOM's
because they set work_mem to be higher than the actual amount of memory
they had and the Sort before the GroupAgg was actually trying to use all
that memory.  The HashAgg ended up not needing that much memory because
the aggregated set wasn't actually that large.  If anything, this shows
exactly what Jeff's fine work here is (hopefully) going to give us- the
option to plan a HashAgg in such cases, since we can accept spilling to
disk if we end up underestimate, or take advantage of that HashAgg
being entirely in memory if we overestimate.

> I don't think that this kind of perverse effect is uncommon. Hash
> aggregate can naturally be far faster than group agg + sort, Hash agg
> can naturally use a lot less memory in many cases, and we have every
> reason to think that grouping estimates are regularly totally wrong.

I'm confused as to why we're talking about the relative performance of a
HashAgg vs. a Sort+GroupAgg- of course the HashAgg is going to be faster
if it's got enough memory, but that's a constraint we have to consider
and deal with because, otherwise, the query can end up failing and
potentially impacting other queries or activity on the system, including
resulting in the entire database system falling over due to the OOM
Killer firing and killing a process and the database ending up
restarting and going through crash recovery, which is going to be quite
a bit worse than performance maybe not being great.

> You're significantly underestimating the risk.

Of... what?  That we'll end up getting worse performance because we
underestimated the size of the result set and we end up spilling to
disk with the HashAgg?  I think I'm giving that risk the amount of
concern it deserves- which is, frankly, not very much.  Users who run
into that issue, as this tweet *also* showed, are familiar with work_mem
and can tune it to address that.  This reaction to demand a new GUC to
break up work_mem into pieces strikes me as unjustified, and doing so
during beta makes it that much worse.

Having looked back, I'm not sure that I'm really in the minority
regarding the proposal to add this at this time either- there's been a
few different comments that it's too late for v13 and/or that we should
see if we actually end up with users seriously complaining about the
lack of a separate way to specify the memory for a given node type,
and/or that if we're going to do this then we should have a broader set
of options covering other nodes types too, all of which are positions
that I agree with.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Default setting for enable_hashagg_disk

Reply via email to