[ https://issues.apache.org/jira/browse/MADLIB-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nikhil updated MADLIB-1200: --------------------------- Description: Related to https://issues.apache.org/jira/browse/MADLIB-1037 https://issues.apache.org/jira/browse/MADLIB-1048 Story {{As a}} data scientist {{I want to}} pre-process input files for use with mini-batching {{so that}} the optimization part of MLP, SVM, etc. runs faster when I do multiple runs, perhaps because I am tuning parameters (i.e., pre-processing is an occasional operation that I don't want to re-do every time that I train a model) Interface {code:java} minibatch_preprocessor ( source_table, -- Name of the table containing the input data. output_table, -- Name of the table suitable for mini-batching. dependent_varname, -- Name of the dependent variable column. independent_varname, -- Expression list to evaluate for the independent variables. buffer_size, -- ??? ){code} The main purpose of the function is to prepare the training data for minibatching algorithms. This will be achieved in 2 stages # Based on the batch size, group all the dependent and independent variables in a single tuple representative of the batch. # If the independent variables are boolean or text, perform one hot encoding. N/A for integer and floats. Note that if the integer vars are actually categorical, they must be case to ::TEXT so that they get encoded. Notes 1) Random shuffle needed for mini-batch. 2) Naive approach may be OK to start, not worth big investment to make run 10% or 20% faster. Acceptance Summary 1) Convert from standard to special format for mini-batching 2) Standardize by default for now but the user cannot opt out of it. We may decide to add a flag later. 3) Some scale testing OK (does not need to be comprehensive) 4) Document as a helper function user docs 5) IC was: Related to https://issues.apache.org/jira/browse/MADLIB-1037 https://issues.apache.org/jira/browse/MADLIB-1048 Story {{As a}} data scientist {{I want to}} pre-process input files for use with mini-batching {{so that}} the optimization part of MLP, SVM, etc. runs faster when I do multiple runs, perhaps because I am tuning parameters (i.e., pre-processing is an occasional operation that I don't want to re-do every time that I train a model) Interface {code:java} minibatch_preprocessor ( source_table, -- Name of the table containing the input data. output_table, -- Name of the table suitable for mini-batching. dependent_varname, -- Name of the dependent variable column. independent_varname, -- Expression list to evaluate for the independent variables. buffer_size, -- ??? ){code} The main purpose of the function is to prepare the training data for minibatching algorithms. This will be achieved in 2 stages # Based on the batch size, group all the dependent and independent variables in a single tuple representative of the batch. # If the independent variables are boolean or text, perform one hot encoding. N/A for integer and floats. Note that if the integer vars are actually categorical, they must be case to ::TEXT so that they get encoded. Notes 1) Random shuffle needed for mini-batch. 2) Naive approach may be OK to start, not worth big investment to make run 10% or 20% faster. Acceptance 1) Convert from standard to special format for mini-batching 2) Some scale testing OK (does not need to be comprehensive) 3) Document as a helper function user docs 4) IC > Pre-processing helper function for mini-batching > ------------------------------------------------- > > Key: MADLIB-1200 > URL: https://issues.apache.org/jira/browse/MADLIB-1200 > Project: Apache MADlib > Issue Type: New Feature > Components: Module: Utilities > Reporter: Frank McQuillan > Priority: Major > Fix For: v1.14 > > > Related to > https://issues.apache.org/jira/browse/MADLIB-1037 > https://issues.apache.org/jira/browse/MADLIB-1048 > Story > {{As a}} > data scientist > {{I want to}} > pre-process input files for use with mini-batching > {{so that}} > the optimization part of MLP, SVM, etc. runs faster when I do multiple runs, > perhaps because I am tuning parameters (i.e., pre-processing is an occasional > operation that I don't want to re-do every time that I train a model) > Interface > {code:java} > minibatch_preprocessor ( > source_table, -- Name of the table containing the input > data. > output_table, -- Name of the table suitable for > mini-batching. > dependent_varname, -- Name of the dependent variable column. > independent_varname, -- Expression list to evaluate for the > independent variables. > buffer_size, -- ??? > ){code} > > The main purpose of the function is to prepare the training data for > minibatching algorithms. This will be achieved in 2 stages > # Based on the batch size, group all the dependent and independent variables > in a single tuple representative of the batch. > # If the independent variables are boolean or text, perform one hot > encoding. N/A for integer and floats. Note that if the integer vars are > actually categorical, they must be case to ::TEXT so that they get encoded. > Notes > 1) Random shuffle needed for mini-batch. > 2) Naive approach may be OK to start, not worth big investment to make run > 10% or 20% faster. > Acceptance > Summary > 1) Convert from standard to special format for mini-batching > 2) Standardize by default for now but the user cannot opt out of it. We may > decide to add a flag later. > 3) Some scale testing OK (does not need to be comprehensive) > 4) Document as a helper function user docs > 5) IC -- This message was sent by Atlassian JIRA (v7.6.3#76005)