Nikhil Kak created MADLIB-1385:
----------------------------------

             Summary: DL: C function for aggregating arrays for input 
preprocessor
                 Key: MADLIB-1385
                 URL: https://issues.apache.org/jira/browse/MADLIB-1385
             Project: Apache MADlib
          Issue Type: Improvement
          Components: Deep Learning
            Reporter: Nikhil Kak
             Fix For: v1.17


As a follow up from https://issues.apache.org/jira/browse/MADLIB-1334, we 
noticed that if we write our own implementation of array_cat, there was a 6x 
performance gain. Note that this gain was with incomplete code so it is 
possible that we may not have any performance gain at all in the worst case 
scenario. It is still promising enough to warrant a story.

See attached branch and 
https://github.com/reductionista/madlib/blob/b1beaa88c3ec3c2ed412d2c20b5a3cad2f31a8ef/methods/array_ops/src/pg_gp/array_ops.c#L2153

{code}
Compared `agg_array_concat` vs `my_agg_array_concat` on local mac with gpdb 
5.21 and 3 segments

madlib=#select array_dims(madlib.agg_array_concat(ARRAY[x])) from 
mnist_train_10k;
array_dims
------------------
[1:10000][1:784]
Time: 30949.794 ms


madlib=# select array_dims(madlib.my_agg_array_concat(ARRAY[x], 3380)) from 
mnist_train_10k;
array_dims
------------------
 [1:10000][1:784]
Time: 570.079 ms


We see a 6x performance gain here with our C function but the caveat is that 
our function isn't complete yet. 
Right now inside the uda's transition function, we allocate all the memory in 
advance by passing 
max(total rows per segment for all the segments). This will have to changed so 
that we can use something 
like repalloc to grow the memory incrementally. This may or may not slow our C 
function. 
Will create a separate story for working on the C function. 
{code}

We  can reference the array_agg implementation in pg > 9.5 which supports 
taking arrays as input and also make sure we follow the gpdb guidance See
https://gpdb.docs.pivotal.io/6-0/ref_guide/sql_commands/CREATE_AGGREGATE.html
https://github.com/postgres/postgres/blob/7eefec61c9e3d053e063696b71e8f11b268b98f4/src/backend/utils/adt/array_userfuncs.c#L542
https://github.com/postgres/postgres/blob/7eefec61c9e3d053e063696b71e8f11b268b98f4/src/backend/utils/adt/arrayfuncs.c#L5195

Guidance on how to avoid memory leaks when writing your own UDA in greenplum. 
>From 
>https://gpdb.docs.pivotal.io/6-0/ref_guide/sql_commands/CREATE_AGGREGATE.html
{code}
Note: If you write a user-defined aggregate in C, and you declare the state 
value (stype) as type internal, there is a risk of an
out-of-memory error occurring. If internal state values are not properly 
managed and a query acquires too much memory 
for state values, an out-of-memory error could occur. To prevent this, use 
mpool_alloc(mpool, size) to have Greenplum manage 
and allocate memory for non-temporary state values, that is, state values that 
have a lifespan for the entire aggregation. 
The argument mpool of the mpool_alloc() function is 
aggstate->hhashtable->group_buf. 
For an example, see the implementation of the numeric data type aggregates in 
src/backend/utils/adt/numeric.c 
in the Greenplum Database open source code.
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to