[
https://issues.apache.org/jira/browse/MADLIB-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16947207#comment-16947207
]
Nikhil Kak commented on MADLIB-1334:
------------------------------------
We made some progress with this Jira
# *Using real[] instead of anyarray in agg_array_concat solves the scaling
problem i.e. running input preprocessor on same dataset with different buffer
sizes results in comparable runtimes.* . This is because the plan for the agg
with real[] aggregates on the segments and then gathers on the master but the
plan for the agg with anyarray aggregate first gathers all the data on the
master and then runs the aggregate function on the master.
# Another improvement which only applies to gpdb ( Will create a new Jira for
this)
## When we create a table(for normalizing and one hot endcoding ) inside the
preprocessor UDF, it does not populate the stats for that table. As a result,
the plan seems to pick GroupAgg which is much slower than HashAgg. We found
that setting {{gp_autostats_mode_in_functions to 'ON_NO_STATS'}} makes it so
that the CTAS query generates stats as well and then the plan used HashAgg and
runs much faster. The downside is that we are using array_cat as our transition
function for the aggregate. The existing version of array_cat is not efficient
and slow because of how it manages memory. See
[https://github.com/greenplum-db/gpdb/issues/8764#issuecomment-539205619] for
more details.
So the solution is to write our own C function for array_cat. We can reference
the array_agg implementation in pg > 9.5 which supports taking arrays as input
and also make sure we follow the gpdb guidance See
###
[https://gpdb.docs.pivotal.io/6-0/ref_guide/sql_commands/CREATE_AGGREGATE.html]
###
[https://github.com/postgres/postgres/blob/7eefec61c9e3d053e063696b71e8f11b268b98f4/src/backend/utils/adt/array_userfuncs.c#L542]
###
[https://github.com/postgres/postgres/blob/7eefec61c9e3d053e063696b71e8f11b268b98f4/src/backend/utils/adt/arrayfuncs.c#L5195]
> Mini-batch preprocessor for DL running very slowly
> --------------------------------------------------
>
> Key: MADLIB-1334
> URL: https://issues.apache.org/jira/browse/MADLIB-1334
> Project: Apache MADlib
> Issue Type: Bug
> Components: Module: Utilities
> Reporter: Frank McQuillan
> Assignee: Nikhil Kak
> Priority: Major
> Fix For: v1.16
>
>
> Observed on 2-segment Greenplum 5.x cluster using lastest build from MASTER:
> current `minibatch_preprocessor`
> 1) 60K MNIST training examples = 28.1 sec
> 2) 10K MNIST test examples = 5.9 sec
> new `minibatch_preprocessor_dl`
> 3) 60K MNIST training examples = 1912.3 sec
> 4) 10K MNIST test examples = 24.2 sec
> Wonder if there is a bug here, or at least a performance issue? I thought
> `minibatch_preprocessor_dl` was supposed to be faster than
> `minibatch_preprocessor`
> (1)
> {code}
> madlib=#
> madlib=# SELECT madlib.minibatch_preprocessor('mnist_train', --
> Source table
> madlib(# 'mnist_train_packed', --
> Output table
> madlib(# 'y', --
> Dependent variable
> madlib(# 'x', --
> Independent variables
> madlib(# NULL, --
> Grouping
> madlib(# NULL, --
> Buffer size
> madlib(# TRUE --
> One-hot encode integer dependent var
> madlib(# );
> minibatch_preprocessor
> ------------------------
>
> (1 row)
> Time: 28093.977 ms
> {code}
> (2)
> {code}
> madlib=# SELECT madlib.minibatch_preprocessor('mnist_test', -- Source
> table
> madlib(# 'mnist_test_packed', -- Output
> table
> madlib(# 'y', --
> Dependent variable
> madlib(# 'x', --
> Independent variables
> madlib(# NULL, --
> Grouping
> madlib(# NULL, --
> Buffer size
> madlib(# TRUE --
> One-hot encode integer dependent var
> madlib(# );
> minibatch_preprocessor
> ------------------------
>
> (1 row)
> Time: 5934.194 ms
> {code}
> (3)
> {code}
> madlib=# SELECT madlib.minibatch_preprocessor_dl('mnist_train', --
> Source table
> madlib(# 'mnist_train_packed', --
> Output table
> madlib(# 'y', --
> Dependent variable
> madlib(# 'x', --
> Independent variable
> madlib(# NULL, --
> Buffer size
> madlib(# 255, --
> Normalizing constant
> madlib(# NULL
> madlib(# );
> minibatch_preprocessor_dl
> ---------------------------
>
> (1 row)
> Time: 1912268.396 ms
> {code}
> (4)
> {code}
> madlib=# SELECT madlib.minibatch_preprocessor_dl('mnist_test', --
> Source table
> madlib(# 'mnist_test_packed', --
> Output table
> madlib(# 'y', --
> Dependent variable
> madlib(# 'x', --
> Independent variable
> madlib(# NULL, --
> Buffer size
> madlib(# 255, --
> Normalizing constant
> madlib(# NULL
> madlib(# );
> minibatch_preprocessor_dl
> ---------------------------
>
> (1 row)
> Time: 24192.195 ms
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)