[ 
https://issues.apache.org/jira/browse/MADLIB-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16947207#comment-16947207
 ] 

Nikhil Kak edited comment on MADLIB-1334 at 10/8/19 10:05 PM:
--------------------------------------------------------------

We made some progress with this Jira

 
 # *Using real[] instead of anyarray in agg_array_concat solves the scaling 
problem i.e. running input preprocessor on same dataset with different buffer 
sizes results in comparable runtimes.* . This is because the plan for the agg 
with real[] aggregates on the segments and then gathers on the master but the 
plan for the agg with anyarray aggregate first gathers all the data on the 
master and then runs the aggregate function on the master.
 # Another improvement which only applies to gpdb ( Will create a new Jira for 
this)
 ## When we create a table(for normalizing and one hot endcoding ) inside the 
preprocessor UDF, it does not populate the stats for that table. As a result, 
the plan seems to pick GroupAgg which is much slower than HashAgg. We found 
that setting {{gp_autostats_mode_in_functions to 'ON_NO_STATS'}} makes it so 
that the CTAS query generates stats and then the plan uses HashAgg and runs 
much faster. The downside is that we are using array_cat as our transition 
function for the aggregate. The existing version of array_cat is not efficient 
and slow because of how it manages memory. See 
[https://github.com/greenplum-db/gpdb/issues/8764#issuecomment-539205619] for 
more details.
 So the solution is to write our own C function for array_cat. We  can 
reference the array_agg implementation in pg > 9.5 which supports taking arrays 
as input and also make sure we follow the gpdb guidance See
 ## 
[https://gpdb.docs.pivotal.io/6-0/ref_guide/sql_commands/CREATE_AGGREGATE.html]
 ## 
[https://github.com/postgres/postgres/blob/7eefec61c9e3d053e063696b71e8f11b268b98f4/src/backend/utils/adt/array_userfuncs.c#L542]
 ## 
[https://github.com/postgres/postgres/blob/7eefec61c9e3d053e063696b71e8f11b268b98f4/src/backend/utils/adt/arrayfuncs.c#L5195]

 


was (Author: nkak):
We made some progress with this Jira

 
 # *Using real[] instead of anyarray in agg_array_concat solves the scaling 
problem i.e. running input preprocessor on same dataset with different buffer 
sizes results in comparable runtimes.* . This is because the plan for the agg 
with real[] aggregates on the segments and then gathers on the master but the 
plan for the agg with anyarray aggregate first gathers all the data on the 
master and then runs the aggregate function on the master.
 # Another improvement which only applies to gpdb ( Will create a new Jira for 
this)

 ## When we create a table(for normalizing and one hot endcoding ) inside the 
preprocessor UDF, it does not populate the stats for that table. As a result, 
the plan seems to pick GroupAgg which is much slower than HashAgg. We found 
that setting {{gp_autostats_mode_in_functions to 'ON_NO_STATS'}} makes it so 
that the CTAS query generates stats as well and then the plan used HashAgg and 
runs much faster. The downside is that we are using array_cat as our transition 
function for the aggregate. The existing version of array_cat is not efficient 
and slow because of how it manages memory. See 
[https://github.com/greenplum-db/gpdb/issues/8764#issuecomment-539205619] for 
more details. 
So the solution is to write our own C function for array_cat. We  can reference 
 the array_agg implementation in pg > 9.5 which supports taking arrays as input 
and also make sure we follow the gpdb guidance See
 ### 
[https://gpdb.docs.pivotal.io/6-0/ref_guide/sql_commands/CREATE_AGGREGATE.html]
 ### 
[https://github.com/postgres/postgres/blob/7eefec61c9e3d053e063696b71e8f11b268b98f4/src/backend/utils/adt/array_userfuncs.c#L542]
 ### 
[https://github.com/postgres/postgres/blob/7eefec61c9e3d053e063696b71e8f11b268b98f4/src/backend/utils/adt/arrayfuncs.c#L5195]
 

 

> Mini-batch preprocessor for DL running very slowly
> --------------------------------------------------
>
>                 Key: MADLIB-1334
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1334
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: Module: Utilities
>            Reporter: Frank McQuillan
>            Assignee: Nikhil Kak
>            Priority: Major
>             Fix For: v1.16
>
>
> Observed on 2-segment Greenplum 5.x cluster using lastest build from MASTER:
> current `minibatch_preprocessor`
> 1) 60K MNIST training examples = 28.1 sec
> 2) 10K MNIST test examples = 5.9 sec
> new `minibatch_preprocessor_dl`
> 3) 60K MNIST training examples = 1912.3 sec
> 4) 10K MNIST test examples = 24.2 sec
> Wonder if there is a bug here, or at least a performance issue?  I thought 
> `minibatch_preprocessor_dl` was supposed to be faster than 
> `minibatch_preprocessor` 
> (1)
> {code}
> madlib=# 
> madlib=# SELECT madlib.minibatch_preprocessor('mnist_train',         -- 
> Source table
> madlib(#                                      'mnist_train_packed',  -- 
> Output table
> madlib(#                                      'y',                   -- 
> Dependent variable
> madlib(#                                      'x',                   -- 
> Independent variables
> madlib(#                                      NULL,                  -- 
> Grouping 
> madlib(#                                      NULL,                  -- 
> Buffer size
> madlib(#                                      TRUE                   -- 
> One-hot encode integer dependent var
> madlib(#                                      );
>  minibatch_preprocessor 
> ------------------------
>  
> (1 row)
> Time: 28093.977 ms
> {code}
> (2)
> {code}
> madlib=# SELECT madlib.minibatch_preprocessor('mnist_test',         -- Source 
> table
> madlib(#                                      'mnist_test_packed',  -- Output 
> table
> madlib(#                                      'y',                   -- 
> Dependent variable
> madlib(#                                      'x',                   -- 
> Independent variables
> madlib(#                                      NULL,                  -- 
> Grouping 
> madlib(#                                      NULL,                  -- 
> Buffer size
> madlib(#                                      TRUE                   -- 
> One-hot encode integer dependent var
> madlib(#                                      );
>  minibatch_preprocessor 
> ------------------------
>  
> (1 row)
> Time: 5934.194 ms
> {code}
> (3)
> {code}
> madlib=# SELECT madlib.minibatch_preprocessor_dl('mnist_train',         -- 
> Source table
> madlib(#                                         'mnist_train_packed',  -- 
> Output table
> madlib(#                                         'y',                   -- 
> Dependent variable
> madlib(#                                         'x',                   -- 
> Independent variable
> madlib(#                                          NULL,                 -- 
> Buffer size
> madlib(#                                          255,                  -- 
> Normalizing constant
> madlib(#                                          NULL
> madlib(#                                         ); 
>  minibatch_preprocessor_dl 
> ---------------------------
>  
> (1 row)
> Time: 1912268.396 ms
> {code}
> (4)
> {code}
> madlib=# SELECT madlib.minibatch_preprocessor_dl('mnist_test',         -- 
> Source table
> madlib(#                                         'mnist_test_packed',  -- 
> Output table
> madlib(#                                         'y',                   -- 
> Dependent variable
> madlib(#                                         'x',                   -- 
> Independent variable
> madlib(#                                          NULL,                 -- 
> Buffer size
> madlib(#                                          255,                  -- 
> Normalizing constant
> madlib(#                                          NULL
> madlib(#                                         ); 
>  minibatch_preprocessor_dl 
> ---------------------------
>  
> (1 row)
> Time: 24192.195 ms
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to