Re: Parallel Aggregates for string_agg and array_agg

David Rowley Mon, 26 Mar 2018 15:05:58 -0700

On 27 March 2018 at 09:27, Tom Lane <t...@sss.pgh.pa.us> wrote:
> I spent a fair amount of time hacking on this with intent to commit,
> but just as I was getting to code that I liked, I started to have second
> thoughts about whether this is a good idea at all.  I quote from the fine
> manual:
>
>     The aggregate functions array_agg, json_agg, jsonb_agg,
>     json_object_agg, jsonb_object_agg, string_agg, and xmlagg, as well as
>     similar user-defined aggregate functions, produce meaningfully
>     different result values depending on the order of the input
>     values. This ordering is unspecified by default, but can be controlled
>     by writing an ORDER BY clause within the aggregate call, as shown in
>     Section 4.2.7. Alternatively, supplying the input values from a sorted
>     subquery will usually work ...
>
> I do not think it is accidental that these aggregates are exactly the ones
> that do not have parallelism support today.  Rather, that's because you
> just about always have an interest in the order in which the inputs get
> aggregated, which is something that parallel aggregation cannot support.


This was not in my list of reasons for not adding them the first time
around. I mentioned these reasons in a response to Stephen.

> I fear that what will happen, if we commit this, is that something like
> 0.01% of the users of array_agg and string_agg will be pleased, another
> maybe 20% will be unaffected because they wrote ORDER BY which prevents
> parallel aggregation, and the remaining 80% will scream because we broke
> their queries.  Telling them they should've written ORDER BY isn't going
> to cut it, IMO, when the benefit of that breakage will accrue only to some
> very tiny fraction of use-cases.

This very much reminds me of something that exists in the 8.4 release notes:

> SELECT DISTINCT and UNION/INTERSECT/EXCEPT no longer always produce sorted 
> output (Tom)

> Previously, these types of queries always removed duplicate rows by means of 
> Sort/Unique processing (i.e., sort then remove adjacent duplicates). Now they 
> can be implemented by hashing, which will not produce sorted output. If an 
> application relied on the output being in sorted order, the recommended fix 
> is to add an ORDER BY clause. As a short-term workaround, the previous 
> behavior can be restored by disabling enable_hashagg, but that is a very 
> performance-expensive fix. SELECT DISTINCT ON never uses hashing, however, so 
> its behavior is unchanged.

Seems we were happy enough then to tell users to add an ORDER BY.

However, this case is different, since before the results were always
ordered. This time they're possibly ordered. So we'll probably
surprise fewer people this time around.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Parallel Aggregates for string_agg and array_agg

Reply via email to