Re: [HACKERS] WITHIN GROUP patch

Tom Lane Thu, 05 Dec 2013 17:50:09 -0800

Further questions about WITHIN GROUP:

I believe that the spec requires that the "direct" arguments of an inverse
or hypothetical-set aggregate must not contain any Vars of the current
query level.  They don't manage to say that in plain English, of course,
but in the <hypothetical set function> case the behavior is defined in
terms of this sub-select:


        ( SELECT 0, SK1, ..., SKK
          FROM TNAME UNION ALL
          VALUES (1, VE1, ..., VEK) )

where SKn are the sort key values, TNAME is an alias for the input table,
and VEn are the direct arguments.  Any input-table Vars the aggregate
might refer to are thus in scope only for the SKn not the VEn.  (This
definition also makes it clear that the VEn are to be evaluated once,
not once per row.)  In the <inverse distribution function> case, they
resort to claiming that the value of the <inverse distribution function
argument> can be replaced by a numeric literal, which again makes it clear
that it's supposed to be evaluated just once.

So that's all fine, but the patch seems a bit confused about it.  If you
try to cheat, you get an error message that's less than apropos:

regression=# select cume_dist(f1) within group(order by f1) from text_tbl ;
ERROR:  column "text_tbl.f1" must appear in the GROUP BY clause or be used in 
an aggregate function
LINE 1: select cume_dist(f1) within group(order by f1) from text_tbl...
                         ^

I'd have hoped for a message along the line of "fixed arguments of
cume_dist() must not contain variables", similar to the phrasing we
use when you try to put a variable in LIMIT.

Also, there are some comments and code changes in nodeAgg.c that seem
to be on the wrong page here:
  
+       /*
+        * Use the representative input tuple for any references to
+        * non-aggregated input columns in the qual and tlist.    (If we are not
+        * grouping, and there are no input rows at all, we will come here
+        * with an empty firstSlot ... but if not grouping, there can't be any
+        * references to non-aggregated input columns, so no problem.)
+        * We do this before finalizing because for ordered set functions,
+        * finalize_aggregates can evaluate arguments referencing the tuple.
+        */
+       econtext->ecxt_outertuple = firstSlot;
+ 

The last two lines of that comment are new, all the rest was moved from
someplace else.  Those last two lines are wrong according to the above
reasoning, and if they were correct the argument made in the pre-existing
part of the comment would be all wet, meaning the code could in fact crash
when trying to evaluate a Var reference in finalize_aggregates.

So I'm inclined to undo this part of the patch (and anything else that's
rearranging logic with an eye to Var references in finalize_aggregates),
and to try to fix parse_agg.c so it gives a more specific error message
for this case.

After looking at this I'm a bit less enthused about the notion of hybrid
aggregates than I was before.  I now see that the spec intends that when
the WITHIN GROUP notation is used, *all* the arguments to the left of
WITHIN GROUP are meant to be fixed evaluate-once arguments.  While we
could possibly define aggregates for which some of those argument
positions are treated as evaluate-once-per-row arguments, I'm realizing
that that'd likely be pretty darn confusing for users.

What I now think we should do about the added pg_aggregate columns is
to have:

aggnfixedargs   int     number of fixed arguments, excluding any
                        hypothetical ones (always 0 for normal aggregates)
aggkind         char    'n' for normal aggregate, 'o' for ordered set function,
                        'h' for hypothetical-set function

with the parsing rules that given

     agg( n fixed arguments ) WITHIN GROUP ( ORDER BY k sort specifications )

1. WITHIN GROUP is disallowed for normal aggregates.

2. For an ordered set function, n must equal aggnfixedargs.  We treat all
n fixed arguments as contributing to the aggregate's result collation,
but ignore the sort arguments.

3. For a hypothetical-set function, n must equal aggnfixedargs plus k,
and we match up type and collation info of the last k fixed arguments
with the corresponding sort arguments.  The first n-k fixed arguments
contribute to the aggregate's result collation, the rest not.

Reading back over this email, I see I've gone back and forth between
using the terms "direct args" and "fixed args" for the evaluate-once
stuff to the left of WITHIN GROUP.  I guess I'm not really sold on
either terminology, but we need something we can use consistently
in the code and docs.  The spec is no help, it has no generic term
at all for these args.  Does anybody else have a preference, or
maybe another suggestion entirely?

                        regards, tom lane


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WITHIN GROUP patch

Reply via email to