On 12 Srpen 2014, 7:06, Jeff Davis wrote:
> On Mon, 2014-08-11 at 01:29 +0200, Tomas Vondra wrote:
>> On 10.8.2014 23:26, Jeff Davis wrote:
>> > This patch is requires the Memory Accounting patch, or something
>> > similar to track memory usage.
>> I think the patch you sent actually includes the accounting patch. Is
>> that on purpose, or by accident?
> Accident, thank you.
>> So once a group gets into memory, it stays there? That's going to work
>> fine for aggregates with fixed-size state (int4, or generally state that
>> gets allocated and does not grow), but I'm afraid for aggregates with
>> growing state (as for example array_agg and similar) that's not really a
>> solution.
> I agree in theory, but for now I'm just not handling that case at all
> because there is other work that needs to be done first. For one thing,
> we would need a way to save the transition state, and we don't really
> have that. In the case of array_agg, the state is not serialized and
> there's no generic way to ask it to serialize itself without finalizing.

Yes and no.

It's true we don't have this ability for aggregates passing state using
'internal', and arguably these are the cases that matter (because those
are the states that tend to "bloat" as more values are passed to the

We can do that for states with a known type (because we have serialize
deserialize methods for them), but we can't really require all aggregates
to use only known types. The 'internal' is there for a reason.

So I think eventually we should to support something like this:

CREATE AGGREGATE myaggregate (
    SERIALIZE_FUNC = 'dump_data',
    DESERIALIZE_FUNC = 'read_data',

That being said, we can't require this from all existing aggregates.
There'll always be aggregates not providing this (for example some old

So even if we have this, we'll have to support the case when it's not
provided - possibly by using the batching algorithm you provided. What
I imagine is this:

   hitting work_mem limit -> do we know how to dump the aggregate state?

     yes (known type or serialize/deserialize)
         => use the batching algorithm from hash join

     no (unknown type, ...)
         => use the batching algorithm described in the original message

Now, I'm not trying to make you implement all this - I'm willing to work
on that. Implementing this CREATE AGGREGATE extension is however tightly
coupled with your patch, because that's the only place where it might be
used (that I'm aware of).

> I'm open to ideas. Do you think my patch is going generally in the right
> direction, and we can address this problem later; or do you think we
> need a different approach entirely?

I certainly think having memory-bounded hashagg is a great improvement,
and yes - this patch can get us there. Maybe it won't get us all the way
to the "perfect solution" but so what? We can improve that by further
patches (and I'm certainly willing to spend some time on that).

So thanks a lot for working on this!

>> While hacking on the hash join, I envisioned the hash aggregate might
>> work in a very similar manner, i.e. something like this:
>>   * nbatches=1, nbits=0
>>   * when work_mem gets full => nbatches *= 2, nbits += 1
>>   * get rid of half the groups, using nbits from the hash
>>      => dump the current states into 'states.batchno' file
>>      => dump further tuples to 'tuples.batchno' file
>>   * continue until the end, or until work_mem gets full again
> It would get a little messy with HashAgg. Hashjoin is dealing entirely
> with tuples; HashAgg deals with tuples and groups.

I don't see why it should get messy? In the end, you have a chunk of
data and a hash for it.

> Also, if the transition state is fixed-size (or even nearly so), it
> makes no sense to remove groups from the hash table before they are
> finished. We'd need to detect that somehow, and it seems almost like two
> different algorithms (though maybe not a bad idea to use a different
> algorithm for things like array_agg).

It just means you need to walk through the hash table, look at the
hashes and dump ~50% of the groups to a file. I'm not sure how difficult
that is with dynahash, though (hashjoin uses a custom hashtable, that
makes this very simple).

> Not saying that it can't be done, but (unless you have an idea) requires
> quite a bit more work than what I did here.
>> It also seems to me that the logic of the patch is about this:
>>   * try to lookup the group in the hash table
>>     * found => call the transition function
>>     * not found
>>         * enough space => call transition function
>>         * not enough space => tuple/group goes to a batch
>> Which pretty much means all tuples need to do the lookup first. The nice
>> thing on the hash-join approach is that you don't really need to do the
>> lookup - you just need to peek at the hash whether the group belongs to
>> the current batch (and if not, to which batch it should go).
> That's an interesting point. I suspect that, in practice, the cost of
> hashing the tuple is more expensive (or at least not much cheaper than)
> doing a failed lookup.

I think you're missing the point, here. You need to compute the hash in
both cases. And then you either can do a lookup or just peek at the first
few bits of the hash to see whether it's in the current batch or not.

Certainly, doing this:

    batchno = hash & (nbatches - 1);

    if (batchno > curbatch) {
       ... not current batch, dump to file ...

is much faster than a lookup. Also, as the hash table grows (beyond L3
cache size, which is a few MBs today), it becomes much slower in my
experience - that's one of the lessons I learnt while hacking on the
hashjoin. And we're dealing with hashagg not fitting into work_mem, so
this seems to be relevant.

>> For aggregates using 'internal' to pass pointers that requires some help
>> from the author - serialization/deserialization functions.
> Ah, yes, this is what I was referring to earlier.
>> >    * EXPLAIN details for disk usage
>> >    * choose number of partitions intelligently
>> What is the purpose of HASH_DISK_MAX_PARTITIONS? I mean, when we decide
>> we need 2048 partitions, why should we use less if we believe it will
>> get us over work_mem?
> Because I suspect there are costs in having an extra file around that
> I'm not accounting for directly. We are implicitly assuming that the OS
> will keep around enough buffers for each BufFile to do sequential writes
> when needed. If we create a zillion partitions, then either we end up
> with random I/O or we push the memory burden into the OS buffer cache.

Assuming I understand it correctly, I think this logic is broken. Are you
saying "We'll try to do memory-bounded hashagg, but not for the really
large datasets because of fear we might cause random I/O"?

While I certainly understand your concerns about generating excessive
amount of random I/O, I think the modern filesystem are handling that just
fine (coalescing the writes into mostly sequential writes, etc.). Also,
current hardware is really good at handling this (controllers with write
cache, SSDs etc.).

Also, if hash-join does not worry about number of batches, why should
hashagg worry about that? I expect the I/O patterns to be very similar.

And if you have many batches, it means you have tiny work_mem, compared
to the amount of data. Either you have unreasonably small work_mem
(better fix that) or a lot of data (better have a lot of RAM and good
storage, or you'll suffer anyway).

In any case, trying to fix this by limiting number of partitions seems
like a bad approach. I think factoring those concerns into a costing
model is more appropriate.

> We could try to model those costs explicitly to put some downward
> pressure on the number of partitions we select, but I just chose to cap
> it for now.

OK, understood. We can't get all the goodies in the first version.

>> For us, removing the sort is a big deal, because we're working with
>> >100M rows regularly. It's more complicated though, because the sort is
>> usually enforced by COUNT(DISTINCT) and that's not going to disappear
>> because of this patch. But that's solvable with a custom aggregate.
> I hope this offers you a good alternative.
> I'm not sure it will ever beat sort for very high cardinality cases, but
> I hope it can beat sort when the group size averages something higher
> than one. It will also be safer, so the optimizer can be more aggressive
> about choosing HashAgg.

It's certainly an improvement, although the sort may get there for one
of two reasons:

   (a) COUNT(DISTINCT) -> this is solved by a custom aggregate

   (b) bad estimate of required memory -> this is common for aggregates
       passing 'internal' state (planner uses some quite high defaults)


Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to