[HACKERS] Review: listagg aggregate

David E . Wheeler Fri, 22 Jan 2010 11:14:41 -0800

Pavel,

My review of your listagg patch.


Submission Review
-----------------
* The diff is a context diff and applies cleanly to HEAD (with just two hunks 
offset by 2 lines each).

* There is documentation, though I'm not sure it needs to be mentioned in the 
string functions documentation. No harm in it, I guess.

  I would like to see an example, though, and the documentation does not 
currently explain what each of the parameters are for. In fact, it looks like 
all the existing aggregates take only one parameter, so there was not 
previously a need to explain it. But listagg() has an optional second param. I 
think that the description should explain what it's for.

* There are tests and they look fine.

Usability Review
----------------
* The patch does in fact implement the aggregate function it describes, and OH 
YES do we want it (I've written my own in SQL a few times).

* No, we don't already have it.

* Yes it follows community-agreed behavior. I'm assuming that there is no 
special parsing of aggregate functions, so the simple use of commas to separate 
the two parameters is appropriate, rather than using a keyword like MySQL's 
SEPARATOR in the group_concat() aggregate.

* No need to have pg_dump support, no dangers that I can see, looks like all 
the bases have been covered.

Feature Test
------------
* Everything built cleanly, but I got an OID dupe error when I tried to init 
the DB. Looks like 2997 and 2998 have been used for something else since you 
created the patch. I changed them to 2995 and 2996 and then it worked.
* The feature appears to work. I didn't see any tests for encodings or other 
data types, so I ran a few myself and they work fine:

postgres=# select listagg(a, U&'-\0441\043B\043E\043D-') from 
(values('aaaa'),('bbbb'),('cccc'
         listagg          
--------------------------
 aaaa-слон-bbbb-слон-cccc
(1 row)

postgres=# select listagg(a, U&'\2014') from 
(values(U&'\0441\043B\043E\043D'),(U&'d\0061t\+000061'),(U&'\0441\043B\043E\043D'))
 AS g(a);
    listagg     
----------------
 слон—data—слон
(1 row)


postgres=# select listagg(a::text) from (values(1),(2),(3)) AS g(a);
 listagg 
---------
 123
(1 row)


Performance Review
------------------

No performance issues, except that it should be faster than a custom aggregate 
that does the same thing. To test, I created a quick custom aggregate (no 
second argument, alas, so listagg() is more flexible) like so:

    CREATE OR REPLACE FUNCTION a2s(ANYARRAY)
    RETURNS TEXT LANGUAGE SQL AS $$
        SELECT array_to_string($1, ',');
    $$;

   CREATE AGGREGATE string_accum (
        SFUNC    = array_accum,
        BASETYPE = ANYELEMENT,
        STYPE    = ANYARRAY,
        INITCOND = '{}',
        FINALFUNC = a2s 
    );

Then I ran some simple tests (thanks for the clue, depesz):

postgres=# select count(*) from (select string_accum(a) from 
(values('aaaa'),('bbbb'),('cccc')) AS g(a), generate_series(1,10000) i) AS x(i);
 count 
-------
     1
(1 row)

Time: 1365.382 ms

postgres=# select count(*) from (select listagg(a) from 
(values('aaaa'),('bbbb'),('cccc')) AS g(a), generate_series(1,10000) i) AS x(i);
 count 
-------
     1
(1 row)

Time: 17.989 ms

So overall, it looks like listagg() is 1-2 orders of magnitude faster. YMMV, 
and my system is built with --enable-cassert and --enable-debug. Still, good 
job.

Coding Review
-------------

* Is varchar.c really the best place to put the ListAggState struct and the 
listagg() function? I grepped the source for array_agg() and it's in 
src/backend/utils/adt/array_userfuncs.c. Maybe there's an equivalent file for 
string functions? Otherwise, the style of the C code looks fine to my untrained 
eye.

  Actually, shouldn't it return text rather than varchar?

* Does it really require four functions to do its work? Might there be some way 
to use the array_agg() C functions and then just a different final function to 
turn it into a string (using the internal array_to_string() function, perhaps)? 
I'm not at all sure about it, but given how little code was required to create 
the same basic functionality in SQL, I'm surprised that the C implementation 
requires four functions (accumStringResult(), listagg1_transfn(), 
listagg2_transfn(), and listagg_finalfn()). Maybe they're required to make it 
fast and avoid the overhead of an array?

* No compiler warnings, I never made it crash, good comments, does what it says 
on the tin. I doubt that there are any portability issues, as the code seems to 
use standard PostgreSQL internal macros and functions.

Architecture Review
-------------------

* No dependencies, things seem to make sense overall, notwithstanding my 
questions in the Coding Review.

Review Review
-------------

The only thing I haven't covered so far is the name. I agree with Tom's 
assertion that the name is awful. Sure there may be a precedent in Oracle, but 
I hardly find that convincing (some of the big corporations seem to do a really 
shitty job naming things in their APIs). Given that we already have 
array_agg(), what about simply concat_agg(), as Tom (wincingly, I grant you) 
suggested? Or string_agg()?

My bike shed is puce.

Attached is a new patch with the changed OIDs and an added phrase in the 
documentation that indicates that the second argument can be used to separate 
the concatenated values.

Best,

David

listagg.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Review: listagg aggregate

Reply via email to