Re: Default setting for enable_hashagg_disk

Peter Geoghegan Fri, 26 Jun 2020 16:41:40 -0700

On Fri, Jun 26, 2020 at 4:00 PM Bruce Momjian <[email protected]> wrote:
> Imagine we set the cluster-wide total of work_mem to 1GB.  If a session
> asks for 100MB, if there are no other active sessions, it can grant the
> entire 100MB.  If there are other sessions running, and 500MB has
> already been allocated, maybe it is only given an active per-node
> work_mem of 50MB.  As the amount of unallocated cluster-wide work_mem
> gets smaller, requests are granted smaller actual allocations.


I think that that's the right approach long term. But right now the
DBA has no way to give hash-based nodes more memory, even though it's
clear that that's where it's truly needed in most cases, across almost
workloads. I think that that's the really glaring problem.

This is just the intrinsic nature of hash-based aggregation and hash
join vs sort-based aggregation and merge join (roughly speaking). It's
much more valuable to be able to do hash-based aggregation in one
pass, especially in cases where hashing already did particularly well
in Postgres v12.

> What we do now makes little sense, because we might have lots of free
> memory, but we force nodes to spill to disk when they exceed a fixed
> work_mem.  I realize this is very imprecise, because you don't know what
> future work_mem requests are coming, or how long until existing
> allocations are freed, but it seems it would have to be better than what
> we do now.

Postgres 13 made hash aggregate respect work_mem. Perhaps it would
have made more sense to teach work_mem to respect hash aggregate,
though.

Hash aggregate cannot consume an unbounded amount of memory in v13,
since the old behavior was clearly unreasonable. Which is great. But
it may be even more unreasonable to force users to conservatively set
the limit on the size of the hash table in an artificial, generic way.

> Since work_mem affect the optimizer choices, I can imagine it getting
> complex since nodes would have to ask the global work_mem allocator how
> much memory it _might_ get, but then ask for final work_mem during
> execution, and they might differ.  Still, our spill costs are so high
> for so many node types, that reducing spills seems like it would be a
> win, even if it sometimes causes poorer plans.

I don't think it's really about the spill costs, at least in one
important sense. If performing a hash aggregate in memory uses twice
as much memory as spilling (with either sorting or hashing), but the
operation completes in one third the time, you have actually saved
memory in the aggregate (no pun intended). Also, the query is 3x
faster, which is a nice bonus! I don't think that this kind of
scenario is rare.

-- 
Peter Geoghegan

Re: Default setting for enable_hashagg_disk

Reply via email to