Re: Spilling hashed SetOps and aggregates to disk

Tomas Vondra Mon, 04 Jun 2018 22:05:39 -0700

On 06/05/2018 04:56 AM, David Rowley wrote:
> On 5 June 2018 at 06:52, Andres Freund <and...@anarazel.de> wrote:
>> That part has gotten a bit easier since, because we have serialize /
>> deserialize operations for aggregates these days.
> 
> True. Although not all built in aggregates have those defined.


Not sure what to do about those, unfortunately. We could stop adding
more groups and hope the aggregate state does not grow further, but
that's about it.

> 
>> I wonder whether, at least for aggregates, the better fix wouldn't be to
>> switch to feeding the tuples into tuplesort upon memory exhaustion and
>> doing a sort based aggregate.  We have most of the infrastructure to do
>> that due to grouping sets. It's just the pre-existing in-memory tuples
>> that'd be problematic, in that the current transition values would need
>> to serialized as well.  But with a stable sort that'd not be
>> particularly problematic, and that could easily be achieved.

That's one of the possible solutions, yes. It's hard to say if it's
better or worse than the other approaches, because that depends on the
number of tuples vs. number of groups etc.

Evicting some (or all) of the in-memory groups can be easily better or
worse, depending on the data set. And I'm not sure the code complexity
would be significantly different.

I expect the eviction strategy to be the primary design challenge of
this patch. The other bits will be mostly determined by this one piece.

While the primary goal of the patch is addressing the OOM risks in hash
aggregate, I think it'd be a mistake to see it just that way. I see it
could allow us to perform hash aggregate more often, even if we know the
groups won't fit into work_mem. If we could estimate how much of the
aggregate state we'll have to spill to disk, we could still prefer
hashagg over groupagg. We would pay the price for eviction, but on large
data sets that can be massively cheaper than having to do sort.

I admit this is a bit hand-wavy, and the costing depends on us being
able to estimate the amount of evicted data. I certainly don't expect to
get this right away (that would move the goal posts for the patch very
far away), but it would be unfortunate to make this improvement
impossible/more difficult in the future.

> 
> Isn't there still a problem determining when the memory exhaustion
> actually happens though?   As far as I know, we've still little
> knowledge how much memory each aggregate state occupies.
> 
> Jeff tried to solve this in [1], but from what I remember, there was
> too much concern about the overhead of the additional accounting code.
> 
> [1] 
> https://www.postgresql.org/message-id/flat/CAKJS1f8yvvvj-sVDv_bcxkzcZKq0ZOTVhX0dHfnYDct2Mycq5Q%40mail.gmail.com#cakjs1f8yvvvj-svdv_bcxkzczkq0zotvhx0dhfnydct2myc...@mail.gmail.com
> 

I had a chat with Jeff Davis at pgcon about this, and IIRC he suggested
a couple of people who were originally worried about the overhead seem
to be accepting it now.

IMHO we do want a memory-bound hash aggregate, and doing some sort of
memory accounting is a must-have part of that. I don't see a way around
this, really. We need to minimize the overhead, of course.

I do not recall the exact approach we ended up with in 2015, though, or
what the measurements with that version were. I see 1-3% mentioned early
in the thread, and there were some additional improvements in subsequent
patch version I think.

I don't think we can realistically improve this (accounting at block
level), and there was a discussion if this is actually an overhead or
merely due to different binary layout.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Spilling hashed SetOps and aggregates to disk

Reply via email to