[ 
https://issues.apache.org/jira/browse/MADLIB-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan updated MADLIB-1200:
------------------------------------
    Description: 
Related to
 https://issues.apache.org/jira/browse/MADLIB-1037
 https://issues.apache.org/jira/browse/MADLIB-1048

Story

{{As a}}
 data scientist
 {{I want to}}
 pre-process input files for use with mini-batching
 {{so that}}
 the optimization part of MLP, SVM, etc. runs faster when I do multiple runs, 
perhaps because I am tuning parameters (i.e., pre-processing is an occasional 
operation that I don't want to re-do every time that I train a model)

Interface
{code:java}
minibatch_preprocessor (
source_table,                      -- Name of the table containing the input 
data.
output_table,                      -- Name of the table suitable for 
mini-batching.
dependent_varname,                 -- Name of the dependent variable column. 
independent_varname,               -- Expression list to evaluate for the 
independent variables.
buffer_size,                       -- ??? 
){code}
 

The main purpose of the function is to prepare the training data for 
minibatching algorithms. This will be achieved in 2 stages
 # Based on the batch size, group all the dependent and independent variables 
in a single tuple representative of the batch.
 # If the independent variables are boolean or text, perform one hot encoding.  
N/A for integer and floats. Note that if the integer vars are actually 
categorical, they must be case to ::TEXT so that they get encoded.  

Notes

1) Random shuffle needed for mini-batch.
 2) Naive approach may be OK to start, not worth big investment to make run 10% 
or 20% faster.

Acceptance

1) Convert from standard to special format for mini-batching
 2) Some scale testing OK (does not need to be comprehensive)
 3) Document as a helper function user docs
 4) IC

  was:
Related to
 https://issues.apache.org/jira/browse/MADLIB-1037
 https://issues.apache.org/jira/browse/MADLIB-1048

Story

{{As a}}
 data scientist
 {{I want to}}
 pre-process input files for use with mini-batching
 {{so that}}
 the optimization part of MLP, SVM, etc. runs faster when I do multiple runs, 
perhaps because I am tuning parameters (i.e., pre-processing is an occasional 
operation that I don't want to re-do every time that I train a model)

Interface

This function is kind of the inverse of:

 

Suggested interface:
{code:java}
minibatch_preprocessor (
source_table, 
output_table,
dependent_varname,
independent_varname,
batch_size,                          – Number of elements to pack
encode                               – One-hot encoding if set to TRUE
){code}
 

The main purpose of the function is to prepare the training data for 
minibatching algorithms. This will be achieved in 2 stages

1. Based on the batch size, group all the dependent and independent variables 
in a single tuple representative of the batch.
 2. If the encode parameter is True, perform one hot encoding for the dependent 
variable. Users will need to set encode to true for multi class SVM/MLP and 
false for single class SVM.

Notes

1) Random shuffle needed for mini-batch.
 2) Naive approach may be OK to start, not worth big investment to make run 10% 
or 20% faster.

Acceptance

1) Convert from standard to special format for mini-batching
 2) Some scale testing OK (does not need to be comprehensive)
 3) Document as a helper function user docs
 4) IC


> Pre-processing helper function for mini-batching 
> -------------------------------------------------
>
>                 Key: MADLIB-1200
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1200
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Utilities
>            Reporter: Frank McQuillan
>            Priority: Major
>             Fix For: v1.14
>
>
> Related to
>  https://issues.apache.org/jira/browse/MADLIB-1037
>  https://issues.apache.org/jira/browse/MADLIB-1048
> Story
> {{As a}}
>  data scientist
>  {{I want to}}
>  pre-process input files for use with mini-batching
>  {{so that}}
>  the optimization part of MLP, SVM, etc. runs faster when I do multiple runs, 
> perhaps because I am tuning parameters (i.e., pre-processing is an occasional 
> operation that I don't want to re-do every time that I train a model)
> Interface
> {code:java}
> minibatch_preprocessor (
> source_table,                      -- Name of the table containing the input 
> data.
> output_table,                      -- Name of the table suitable for 
> mini-batching.
> dependent_varname,                 -- Name of the dependent variable column. 
> independent_varname,               -- Expression list to evaluate for the 
> independent variables.
> buffer_size,                       -- ??? 
> ){code}
>  
> The main purpose of the function is to prepare the training data for 
> minibatching algorithms. This will be achieved in 2 stages
>  # Based on the batch size, group all the dependent and independent variables 
> in a single tuple representative of the batch.
>  # If the independent variables are boolean or text, perform one hot 
> encoding.  N/A for integer and floats. Note that if the integer vars are 
> actually categorical, they must be case to ::TEXT so that they get encoded.  
> Notes
> 1) Random shuffle needed for mini-batch.
>  2) Naive approach may be OK to start, not worth big investment to make run 
> 10% or 20% faster.
> Acceptance
> 1) Convert from standard to special format for mini-batching
>  2) Some scale testing OK (does not need to be comprehensive)
>  3) Document as a helper function user docs
>  4) IC



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to