Parallel Aggregates for string_agg and array_agg

David Rowley Sun, 17 Dec 2017 06:31:55 -0800

Hi,

While working on partial aggregation a few years ago, I didn't really
think it was worthwhile allowing partial aggregation of string_agg and
array_agg. I soon realised that I was wrong about that and allowing
parallelisation of these aggregates still could be very useful when
many rows are filtered out during the scan.


Some benchmarks that I've done locally show that parallel string_agg
and array_agg do actually perform better, despite the fact that the
aggregate state grows linearly with each aggregated item. Obviously,
the performance will get even better when workers are filtering out
rows before aggregation takes place, so it seems worthwhile doing
this. However, the main reason that I'm motivated to do this is that
there are more uses for partial aggregation other than just parallel
aggregation, and it seems a shame to disable all these features if a
single aggregate does not support partial mode.

I've attached a patch which implements all this. I've had most of it
stashed away for a while now, but I managed to get some time this
weekend to get it into a more completed state.

Things are now looking pretty good for the number of aggregates that
support partial mode.

Just a handful of aggregates now don't support partial aggregation;

postgres=# select aggfnoid from pg_aggregate where aggcombinefn=0 and
aggkind='n';
     aggfnoid
------------------
 xmlagg
 json_agg
 json_object_agg
 jsonb_agg
 jsonb_object_agg
(5 rows)

... and a good number do support it;

postgres=# select count(*) from pg_aggregate where aggcombinefn<>0 and
aggkind='n';
 count
-------
   122
(1 row)

There's probably no reason why the last 5 of those couldn't be done
either, it might just require shifting a bit more work into the final
functions, although, I'm not planning on that for this patch.

As for the patch; there's a bit of a quirk in the implementation of
string_agg. We previously always threw away the delimiter that belongs
to the first aggregated value, but we do now need to keep that around
so we can put it in between two states in the combine function. I
decided the path of least resistance to do this was just to borrow
StringInfo's cursor variable to use as a pointer to the state of the
first value and put the first delimiter before that. Both the
string_agg(text) and string_agg(bytea) already have a final function,
so we just need to skip over the bytes up until the cursor position to
get rid of the first delimiter. I could go and invent some new state
type to do the same, but I don't really see the trouble with what I've
done with StringInfo, but I'll certainly listen if someone else thinks
this is wrong.

Another thing that I might review later about this is seeing about
getting rid of some of the code duplication between
array_agg_array_combine and accumArrayResultArr.

I'm going to add this to PG11's final commitfest rather than the
January 'fest as it seems more like a final commitfest type of patch.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

combinefn_for_string_and_array_aggs_v1.patch
Description: Binary data

Parallel Aggregates for string_agg and array_agg

Reply via email to