Nikhil Kak created MADLIB-1385:
----------------------------------
Summary: DL: C function for aggregating arrays for input
preprocessor
Key: MADLIB-1385
URL: https://issues.apache.org/jira/browse/MADLIB-1385
Project: Apache MADlib
Issue Type: Improvement
Components: Deep Learning
Reporter: Nikhil Kak
Fix For: v1.17
As a follow up from https://issues.apache.org/jira/browse/MADLIB-1334, we
noticed that if we write our own implementation of array_cat, there was a 6x
performance gain. Note that this gain was with incomplete code so it is
possible that we may not have any performance gain at all in the worst case
scenario. It is still promising enough to warrant a story.
See attached branch and
https://github.com/reductionista/madlib/blob/b1beaa88c3ec3c2ed412d2c20b5a3cad2f31a8ef/methods/array_ops/src/pg_gp/array_ops.c#L2153
{code}
Compared `agg_array_concat` vs `my_agg_array_concat` on local mac with gpdb
5.21 and 3 segments
madlib=#select array_dims(madlib.agg_array_concat(ARRAY[x])) from
mnist_train_10k;
array_dims
------------------
[1:10000][1:784]
Time: 30949.794 ms
madlib=# select array_dims(madlib.my_agg_array_concat(ARRAY[x], 3380)) from
mnist_train_10k;
array_dims
------------------
[1:10000][1:784]
Time: 570.079 ms
We see a 6x performance gain here with our C function but the caveat is that
our function isn't complete yet.
Right now inside the uda's transition function, we allocate all the memory in
advance by passing
max(total rows per segment for all the segments). This will have to changed so
that we can use something
like repalloc to grow the memory incrementally. This may or may not slow our C
function.
Will create a separate story for working on the C function.
{code}
We can reference the array_agg implementation in pg > 9.5 which supports
taking arrays as input and also make sure we follow the gpdb guidance See
https://gpdb.docs.pivotal.io/6-0/ref_guide/sql_commands/CREATE_AGGREGATE.html
https://github.com/postgres/postgres/blob/7eefec61c9e3d053e063696b71e8f11b268b98f4/src/backend/utils/adt/array_userfuncs.c#L542
https://github.com/postgres/postgres/blob/7eefec61c9e3d053e063696b71e8f11b268b98f4/src/backend/utils/adt/arrayfuncs.c#L5195
Guidance on how to avoid memory leaks when writing your own UDA in greenplum.
>From
>https://gpdb.docs.pivotal.io/6-0/ref_guide/sql_commands/CREATE_AGGREGATE.html
{code}
Note: If you write a user-defined aggregate in C, and you declare the state
value (stype) as type internal, there is a risk of an
out-of-memory error occurring. If internal state values are not properly
managed and a query acquires too much memory
for state values, an out-of-memory error could occur. To prevent this, use
mpool_alloc(mpool, size) to have Greenplum manage
and allocate memory for non-temporary state values, that is, state values that
have a lifespan for the entire aggregation.
The argument mpool of the mpool_alloc() function is
aggstate->hhashtable->group_buf.
For an example, see the implementation of the numeric data type aggregates in
src/backend/utils/adt/numeric.c
in the Greenplum Database open source code.
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)