On Thu, Aug 14, 2014 at 2:21 PM, Jeff Davis <pg...@j-davis.com> wrote: > On Thu, 2014-08-14 at 12:53 -0400, Tom Lane wrote: >> Oh? So if we have aggregates like array_agg whose memory footprint >> increases over time, the patch completely fails to avoid bloat? > > Yes, in its current form. > >> I might think a patch with such a limitation was useful, if it weren't >> for the fact that aggregates of that nature are a large part of the >> cases where the planner misestimates the table size in the first place. >> Any complication that we add to nodeAgg should be directed towards >> dealing with cases that the planner is likely to get wrong. > > In my experience, the planner has a lot of difficulty estimating the > cardinality unless it's coming from a base table without any operators > above it (other than maybe a simple predicate). This is probably a lot > more common than array_agg problems, simply because array_agg is > relatively rare compared with GROUP BY in general.
I think that's right, and I rather like your (Jeff's) approach. It's definitely true that we could do better if we have a mechanism for serializing and deserializing group states, but (1) I think an awful lot of cases would get an awful lot better even just with the approach proposed here and (2) I doubt we would make the serialization/deserialization interfaces mandatory, so even if we had that we'd probably want a fallback strategy anyway. Furthermore, I don't really see that we're backing ourselves into a corner here. If prohibiting creation of additional groups isn't sufficient to control memory usage, but we have serialization/deserialization functions, we can just pick an arbitrary subset of the groups that we're storing in memory and spool their transition states off to disk, thus reducing memory even further. I understand Tomas' point to be that this is quite different from what we do for hash joins, but I think it's a different problem. In the case of a hash join, there are two streams of input tuples, and we've got to batch them in compatible ways. If we were to, say, exclude an arbitrary subset of tuples from the hash table instead of basing it on the hash code, we'd have to test *every* outer tuple against the hash table for *every* batch. That would incur a huge amount of additional cost vs. being able to discard outer tuples once the batch to which they pertain has been processed. But the situation here isn't comparable, because there's only one input stream. I'm pretty sure we'll want to keep track of which transition states we've spilled due to lack of memory as opposed to those which were never present in the table at all, so that we can segregate the unprocessed tuples that pertain to spilled transition states from the ones that pertain to a group we haven't begun yet. And it might be that if we know (or learn as we go along) that we're going to vastly blow out work_mem it makes sense to use batching to segregate the tuples that we decide not to process onto N tapes binned by hash code, so that we have a better chance that future batches will be the right size to fit in memory. But I'm not convinced that there's a compelling reason why the *first* batch has to be chosen by hash code; we're actually best off picking any arbitrary set of groups that does the best job reducing the amount of data remaining to be processed, at least if the transition states are fixed size and maybe even if they aren't. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers