[ https://issues.apache.org/jira/browse/MADLIB-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nikhil updated MADLIB-1200: --------------------------- Description: Related to https://issues.apache.org/jira/browse/MADLIB-1037 https://issues.apache.org/jira/browse/MADLIB-1048 Story {{As a}} data scientist {{I want to}} pre-process input files for use with mini-batching {{so that}} the optimization part of MLP, SVM, etc. runs faster when I do multiple runs, perhaps because I am tuning parameters (i.e., pre-processing is a 1-time operation) Interface This function is kind of the inverse of: array_unnest_2d_to_1d() [http://madlib.apache.org/docs/latest/array__ops_8sql__in.html#af057b589f2a2cb1095caa99feaeb3d70] but the difference is we want to persist an output table for the packed 2d array. Suggested interface: matrix_nest_1d_to_2d ( source_table, output_table, independent_varname, dependent_varname, n_elements — Number of elements to pack, encode – default True ); where dependent_varname is a column of 1d arrays Or call it array_nest_1d_to_2d () ? Notes 1) Random shuffle needed for mini-batch. 2) Naive approach may be OK to start, not worth big investment to make run 10% or 20% faster. Acceptance 1) Convert from standard to special format for mini-batching 2) Some scale testing OK (does not need to be comprehensive) 3) Document as a helper function user docs 4) IC was: Related to https://issues.apache.org/jira/browse/MADLIB-1037 https://issues.apache.org/jira/browse/MADLIB-1048 Story {{As a}} data scientist {{I want to}} pre-process input files for use with mini-batching {{so that}} the optimization part of MLP, SVM, etc. runs faster when I do multiple runs, perhaps because I am tuning parameters (i.e., pre-processing is a 1-time operation) Interface This function is kind of the inverse of: array_unnest_2d_to_1d() http://madlib.apache.org/docs/latest/array__ops_8sql__in.html#af057b589f2a2cb1095caa99feaeb3d70 but the difference is we want to persist an output table for the packed 2d array. Suggested interface: matrix_nest_1d_to_2d ( source_table, output_table, independent_varname, dependent_varname, n_elements — Number of elements to pack ); where dependent_varname is a column of 1d arrays Or call it array_nest_1d_to_2d () ? Notes 1) Random shuffle needed for mini-batch. 2) Naive approach may be OK to start, not worth big investment to make run 10% or 20% faster. Acceptance 1) Convert from standard to special format for mini-batching 2) Some scale testing OK (does not need to be comprehensive) 3) Document as a helper function user docs 4) IC > Pre-processing helper function for mini-batching > ------------------------------------------------- > > Key: MADLIB-1200 > URL: https://issues.apache.org/jira/browse/MADLIB-1200 > Project: Apache MADlib > Issue Type: New Feature > Components: Module: Utilities > Reporter: Frank McQuillan > Priority: Major > Fix For: v1.14 > > > Related to > https://issues.apache.org/jira/browse/MADLIB-1037 > https://issues.apache.org/jira/browse/MADLIB-1048 > Story > {{As a}} > data scientist > {{I want to}} > pre-process input files for use with mini-batching > {{so that}} > the optimization part of MLP, SVM, etc. runs faster when I do multiple runs, > perhaps because I am tuning parameters (i.e., pre-processing is a 1-time > operation) > Interface > This function is kind of the inverse of: > array_unnest_2d_to_1d() > > [http://madlib.apache.org/docs/latest/array__ops_8sql__in.html#af057b589f2a2cb1095caa99feaeb3d70] > but the difference is we want to persist an output table for the packed 2d > array. > Suggested interface: > matrix_nest_1d_to_2d ( > source_table, > output_table, > independent_varname, > dependent_varname, > n_elements — Number of elements to pack, > encode – default True > ); > where dependent_varname is a column of 1d arrays > Or call it > array_nest_1d_to_2d () > ? > Notes > 1) Random shuffle needed for mini-batch. > 2) Naive approach may be OK to start, not worth big investment to make run > 10% or 20% faster. > Acceptance > 1) Convert from standard to special format for mini-batching > 2) Some scale testing OK (does not need to be comprehensive) > 3) Document as a helper function user docs > 4) IC -- This message was sent by Atlassian JIRA (v7.6.3#76005)