Re: [HACKERS] 9.5: Memory-bounded HashAgg

Robert Haas Fri, 15 Aug 2014 10:54:18 -0700

On Thu, Aug 14, 2014 at 2:21 PM, Jeff Davis <[email protected]> wrote:
> On Thu, 2014-08-14 at 12:53 -0400, Tom Lane wrote:
>> Oh?  So if we have aggregates like array_agg whose memory footprint
>> increases over time, the patch completely fails to avoid bloat?
>
> Yes, in its current form.
>
>> I might think a patch with such a limitation was useful, if it weren't
>> for the fact that aggregates of that nature are a large part of the
>> cases where the planner misestimates the table size in the first place.
>> Any complication that we add to nodeAgg should be directed towards
>> dealing with cases that the planner is likely to get wrong.
>
> In my experience, the planner has a lot of difficulty estimating the
> cardinality unless it's coming from a base table without any operators
> above it (other than maybe a simple predicate). This is probably a lot
> more common than array_agg problems, simply because array_agg is
> relatively rare compared with GROUP BY in general.


I think that's right, and I rather like your (Jeff's) approach.  It's
definitely true that we could do better if we have a mechanism for
serializing and deserializing group states, but (1) I think an awful
lot of cases would get an awful lot better even just with the approach
proposed here and (2) I doubt we would make the
serialization/deserialization interfaces mandatory, so even if we had
that we'd probably want a fallback strategy anyway.

Furthermore, I don't really see that we're backing ourselves into a
corner here.  If prohibiting creation of additional groups isn't
sufficient to control memory usage, but we have
serialization/deserialization functions, we can just pick an arbitrary
subset of the groups that we're storing in memory and spool their
transition states off to disk, thus reducing memory even further.  I
understand Tomas' point to be that this is quite different from what
we do for hash joins, but I think it's a different problem.  In the
case of a hash join, there are two streams of input tuples, and we've
got to batch them in compatible ways.  If we were to, say, exclude an
arbitrary subset of tuples from the hash table instead of basing it on
the hash code, we'd have to test *every* outer tuple against the hash
table for *every* batch.  That would incur a huge amount of additional
cost vs. being able to discard outer tuples once the batch to which
they pertain has been processed.

But the situation here isn't comparable, because there's only one
input stream.  I'm pretty sure we'll want to keep track of which
transition states we've spilled due to lack of memory as opposed to
those which were never present in the table at all, so that we can
segregate the unprocessed tuples that pertain to spilled transition
states from the ones that pertain to a group we haven't begun yet.
And it might be that if we know (or learn as we go along) that we're
going to vastly blow out work_mem it makes sense to use batching to
segregate the tuples that we decide not to process onto N tapes binned
by hash code, so that we have a better chance that future batches will
be the right size to fit in memory.  But I'm not convinced that
there's a compelling reason why the *first* batch has to be chosen by
hash code; we're actually best off picking any arbitrary set of groups
that does the best job reducing the amount of data remaining to be
processed, at least if the transition states are fixed size and maybe
even if they aren't.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 9.5: Memory-bounded HashAgg

Reply via email to