Re: Spilling hashed SetOps and aggregates to disk

Tomas Vondra Mon, 11 Jun 2018 08:33:31 -0700



On 06/11/2018 05:16 PM, Robert Haas wrote:

On Wed, Jun 6, 2018 at 8:16 PM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:

... and this is pretty much what Jeff Davis suggested, I think. The
trouble is, each of those cases behaves nicely/terribly in different
corner cases.


That's a really good point.  If the number of groups is pretty small
compared to the number of input tuples, then you really only ever want
to dump out transition values.  By doing so, you minimize the amount
of data you have to write.  But if the number of groups is, say, half
the number of input tuples, then computing transition values before
you have all the values that belong to that group is probably a waste
of time.  I wonder if we could decide what to do by comparing the
number of input tuples consumed to the number of groups created at the
time we run out of memory.  If we've got less than, I dunno, five
tuples per group, then assume most groups are small.  Pick a strategy
that (mainly) spools input tuples.  Otherwise, pick a strategy that
spools transition tuples.

In either case, it seems like we can pick a pure hashing strategy or
switch to using both hashing and sorting.  For example, IIUC, Andres's
proposal involves spooling mostly input tuples, but can also spool
transition tuples if the transition values consume more memory as they
absorb more tuples.  However, you could do a similar kind of thing
without needing sort support.  When you see a value that's not doesn't
fit in your in-memory hash table, use the hash code to route it to 1
of N batch files.  Have a second set of batch files for transition
tuples in case you need to kick things out of the in-memory hash
table.  Once you finish reading the input, emit all the values that
remain in the in-memory hash table and then process each batch file
separately.

Similarly, David's strategy involves spooling only transition tuples
and then sorting on the group key, but it's again possible to do
something similar without relying on sorting.  Instead of flushing the
in-memory hash table to a tuple store, split the transition tuples it
contains among N batch files based on the hash code.  Once you've read
all of the input, go back and reprocess each batch file, combining
transition values whenever the same group keys appear in more than one
transition tuple.


Yeah, essentially something like the batching in hash joins.

To me, the pure-hashing strategies look a little simpler, but maybe
there's enough performance benefit from combining hashing and sorting
that it's worth the complexity, or maybe we should just accept
whichever variant somebody's willing to code.  But I think we almost
have to have separate handling for many-row-per-group and
few-rows-per-group, because those seem fundamentally different.

I think the underlying question is whether we want to treat this as anemergency safety against OOM (which should be a rare thing in mostcases) or something allowing us to pick hash aggregate more often.

It would be great to get something that performs better than justfalling back to sort (and I was advocating for that), but I'm worried wemight be moving the goalposts way too far.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Spilling hashed SetOps and aggregates to disk

Reply via email to