On Fri, Jun 26, 2020 at 4:00 PM Bruce Momjian <br...@momjian.us> wrote: > Imagine we set the cluster-wide total of work_mem to 1GB. If a session > asks for 100MB, if there are no other active sessions, it can grant the > entire 100MB. If there are other sessions running, and 500MB has > already been allocated, maybe it is only given an active per-node > work_mem of 50MB. As the amount of unallocated cluster-wide work_mem > gets smaller, requests are granted smaller actual allocations.
I think that that's the right approach long term. But right now the DBA has no way to give hash-based nodes more memory, even though it's clear that that's where it's truly needed in most cases, across almost workloads. I think that that's the really glaring problem. This is just the intrinsic nature of hash-based aggregation and hash join vs sort-based aggregation and merge join (roughly speaking). It's much more valuable to be able to do hash-based aggregation in one pass, especially in cases where hashing already did particularly well in Postgres v12. > What we do now makes little sense, because we might have lots of free > memory, but we force nodes to spill to disk when they exceed a fixed > work_mem. I realize this is very imprecise, because you don't know what > future work_mem requests are coming, or how long until existing > allocations are freed, but it seems it would have to be better than what > we do now. Postgres 13 made hash aggregate respect work_mem. Perhaps it would have made more sense to teach work_mem to respect hash aggregate, though. Hash aggregate cannot consume an unbounded amount of memory in v13, since the old behavior was clearly unreasonable. Which is great. But it may be even more unreasonable to force users to conservatively set the limit on the size of the hash table in an artificial, generic way. > Since work_mem affect the optimizer choices, I can imagine it getting > complex since nodes would have to ask the global work_mem allocator how > much memory it _might_ get, but then ask for final work_mem during > execution, and they might differ. Still, our spill costs are so high > for so many node types, that reducing spills seems like it would be a > win, even if it sometimes causes poorer plans. I don't think it's really about the spill costs, at least in one important sense. If performing a hash aggregate in memory uses twice as much memory as spilling (with either sorting or hashing), but the operation completes in one third the time, you have actually saved memory in the aggregate (no pun intended). Also, the query is 3x faster, which is a nice bonus! I don't think that this kind of scenario is rare. -- Peter Geoghegan