On 28-9-2018 18:34, Adriano dos Santos Fernandes wrote:
Hi!

I propose creation of user-defined aggregate functions.

Lets initially define the requirements for it:
- It should work with empty data set too, in case of use without GROUP
BY clause: i.e., COUNT results in 0, while most others functions result
in NULL
- It should also produce "running (or accumulated) aggregation", like
SUM(...) OVER (ORDER BY ...)

What is not proposed here:
- Make a single function work with different source/target data types.
Custom aggregate function is statically typed as normal functions.

Design is open if good alternatives are presented. Initially I think:

- There should be one input parameter and one output parameter.
- SUSPEND is used to make engine feed the input parameter with a new
value, setting AGG_FINISHED to TRUE/FALSE, and returning running
(accumulated) aggregation value.
- Empty data set makes the function already starts with AGG_FINISHED =
TRUE, and non empty data set makes the function starts with AGG_FINISHED
= FALSE and the first row already fed in the input parameter
- When AGG_FINISHED is TRUE, it's expected that user code does not call
SUSPEND anymore, so only for the empty data set case, the value present
in the output parameter when the function ends is the result of the
aggregate function in the query.

I think the syntax below is rather verbose, and I wonder if it can easily be used to support all types of aggregates.

It introduces some additional behavior and syntax that hasn't surfaced before in PSQL, like:

- a pass by reference call mechanism (the input parameter) to pump new values in, this may make it harder to use UDRs for this (not all languages (easily) support such call mechanisms) - a named output parameter for a function, which breaks syntax consistency with normal stored functions - an externally controlled context value (agg_finished), again something that I don't can be easily supported in UDR

Although I haven't used it, I think something like PostgreSQL's aggregate (see https://www.postgresql.org/docs/10/static/xaggr.html and https://www.postgresql.org/docs/10/static/sql-createaggregate.html) may be simpler, but that may require the introduction of some sort of record type (unless arrays are 'good enough') to be able to maintain and pass more complex state. There are also some other 'bears' their as it relies on PostgreSQLs object support.

We could also consider to have some aggregate object that has state and contains the functions that the engine calls for calculating the aggregate. In a way, this aggregate would implement an interface required by the engine.

That interface consists of a procedure state_transition to accept the value(s) of the aggregate, and modifies the internal state, and a function final_state that returns the aggregate value.

The engine creates an instance of the aggregate, and calls the state_transition to add values, and calls the final_state whenever it needs the final result.

The sum example

create aggregate custom_sum(i bigint) returns bigint
as
   declare summed bigint = 0;
begin
   procedure state_transition(i bigint)
   as
   begin
     if (is is null) then leave;
     summed = summed + i;
   end

   function final_state returns bigint
   as
     return summed;
   end
end;

The average example. Be aware that this implementation (like yours) will have a precision problem for a large number of small values, and may be problematic for large values.

create aggregate custom_avg(i double precision) returns double precision
as
  declare current_count integer = 0;
  declare accumulated double precision = 0;
begin
  procedure state_transition(i double precision)
  as
  begin
    if (is is null) then leave;
    current_count = current_count + 1;
    accumulated = accumulated + i;
  end

  function final_state returns double precision
    return accumulated / current_count;
  end
end

Your final example (with a change to not count null):

create aggregate custom_count_plus_1000 (i integer) returns integer
as
  declare current_count integer;
begin
  procedure state_transition(i integer)
  as
  begin
    if (current_count is null) then
      current_count = 0;
    if (is is null) then leave;
    current_count = current_count + 1;
  end

  function final_state returns integer
  as
  begin
    if (current_count is null) then
       return 1000;
    else
       return current_count;
  end
end

This may be extended to also allow support for moving aggregates with window functions (see "Moving-Aggregate Mode" in the PostgreSQL docs) by having additional procedures and functions.

However it looks like the way PostgreSQL passes state between functions may actually be better, as it would probably be easier for supporting grouping sets and other aggregations happening on multiple levels.

Taking the avg example again, such a form could look like:

create aggregate custom_avg(i double precision) returns double precision
as
begin
record_type custom_avg_record (current_count integer, accumulated double precision)

  function initial_state returns custom_avg_record
  as
  begin
    return custom_avg_record(0, 0);
  end

function state_transition(custom_avg_record previous_state, i double precision) returns custom_avg_record
  as
  begin
    if (is is null) then
      return previous_state;

    return custom_avg_record(
        previous_state.current_count + 1,
        previous_state.accumulated + i);
  end

function final_state(custom_avg_record previous_state) returns double precision
    return previous_state.accumulated / previous_state.current_count;
  end
end

Record type syntax etc just made up, would need more fleshing out (and possibly aligned with supporting row values).

Below I put some example (tested mentally only) functions:


------------------
-- Works as standard SUM.
create aggregate function custom_sum (i integer) returns (o integer)
as
begin
     while (not agg_finished)
     do
     begin
         if (i is not null) then
         begin
             if (o is null) then
                 o = 0;

             o = o + i;
         end

         suspend;
     end
end


-- Works as standard AVG.
create aggregate function custom_avg (i double precision) returns (o
double precision)
as
     declare count integer = 0;
     declare accumulated double precision = 0;
begin
     while (not agg_finished)
     do
     begin
         if (i is not null) then
         begin
             count = count + 1;
             accumulated = accumulated + i;
             o = accumulated / count;
         end

         suspend;
     end
end


-- Works as standard COUNT.
  create aggregate function custom_count (i integer) returns (o integer)
as
begin
     o = 0;

     while (not agg_finished)
     do
     begin
         if (i is not null) then
             o = o + 1;
         suspend;
     end
end


-- This function shows the difference of returning value in SUSPEND when
data set is not empty and returning in function termination when data
set is empty.
-- select custom_count_plus_1000(1) from rdb$database -- returns 1
-- select custom_count_plus_1000(1) from rdb$database where 1 = 0 --
returns 1000
create aggregate function custom_count_plus_1000 (i integer) returns (o
integer)
as
begin
     o = 0;

     while (not agg_finished)
     do
     begin
         o = o + 1;
         suspend;
     end

     o = o + 1000;
end

--
Mark Rotteveel


Firebird-Devel mailing list, web interface at 
https://lists.sourceforge.net/lists/listinfo/firebird-devel

Reply via email to