[ 
https://issues.apache.org/jira/browse/MADLIB-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan updated MADLIB-1385:
------------------------------------
    Fix Version/s:     (was: v1.17)
                   v2.0

> DL: C function for aggregating arrays for input preprocessor
> ------------------------------------------------------------
>
>                 Key: MADLIB-1385
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1385
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Deep Learning
>            Reporter: Nikhil Kak
>            Priority: Major
>             Fix For: v2.0
>
>
> As a follow up from https://issues.apache.org/jira/browse/MADLIB-1334, we 
> noticed that if we write our own implementation of array_cat, there was a 6x 
> performance gain. Note that this gain was with incomplete code so it is 
> possible that we may not have any performance gain at all in the worst case 
> scenario. It is still promising enough to warrant a story.
> See attached branch and 
> https://github.com/reductionista/madlib/blob/b1beaa88c3ec3c2ed412d2c20b5a3cad2f31a8ef/methods/array_ops/src/pg_gp/array_ops.c#L2153
> {code}
> Compared `agg_array_concat` vs `my_agg_array_concat` on local mac with gpdb 
> 5.21 and 3 segments
> madlib=#select array_dims(madlib.agg_array_concat(ARRAY[x])) from 
> mnist_train_10k;
> array_dims
> ------------------
> [1:10000][1:784]
> Time: 30949.794 ms
> madlib=# select array_dims(madlib.my_agg_array_concat(ARRAY[x], 3380)) from 
> mnist_train_10k;
> array_dims
> ------------------
>  [1:10000][1:784]
> Time: 570.079 ms
> We see a 6x performance gain here with our C function but the caveat is that 
> our function isn't complete yet. 
> Right now inside the uda's transition function, we allocate all the memory in 
> advance by passing 
> max(total rows per segment for all the segments). This will have to changed 
> so that we can use something 
> like repalloc to grow the memory incrementally. This may or may not slow our 
> C function. 
> Will create a separate story for working on the C function. 
> {code}
> We  can reference the array_agg implementation in pg > 9.5 which supports 
> taking arrays as input and also make sure we follow the gpdb guidance See
> https://gpdb.docs.pivotal.io/6-0/ref_guide/sql_commands/CREATE_AGGREGATE.html
> https://github.com/postgres/postgres/blob/7eefec61c9e3d053e063696b71e8f11b268b98f4/src/backend/utils/adt/array_userfuncs.c#L542
> https://github.com/postgres/postgres/blob/7eefec61c9e3d053e063696b71e8f11b268b98f4/src/backend/utils/adt/arrayfuncs.c#L5195
> Guidance on how to avoid memory leaks when writing your own UDA in greenplum. 
> From 
> https://gpdb.docs.pivotal.io/6-0/ref_guide/sql_commands/CREATE_AGGREGATE.html
> {code}
> Note: If you write a user-defined aggregate in C, and you declare the state 
> value (stype) as type internal, there is a risk of an
> out-of-memory error occurring. If internal state values are not properly 
> managed and a query acquires too much memory 
> for state values, an out-of-memory error could occur. To prevent this, use 
> mpool_alloc(mpool, size) to have Greenplum manage 
> and allocate memory for non-temporary state values, that is, state values 
> that have a lifespan for the entire aggregation. 
> The argument mpool of the mpool_alloc() function is 
> aggstate->hhashtable->group_buf. 
> For an example, see the implementation of the numeric data type aggregates in 
> src/backend/utils/adt/numeric.c 
> in the Greenplum Database open source code.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to