[jira] [Updated] (MADLIB-1200) Pre-processing helper function for mini-batching

Nikhil (JIRA) Thu, 08 Feb 2018 11:57:44 -0800

     [ 
https://issues.apache.org/jira/browse/MADLIB-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Nikhil updated MADLIB-1200:
---------------------------
    Description: 
Related to
 https://issues.apache.org/jira/browse/MADLIB-1037
 https://issues.apache.org/jira/browse/MADLIB-1048

Story

{{As a}}
 data scientist
 {{I want to}}
 pre-process input files for use with mini-batching
 {{so that}}
 the optimization part of MLP, SVM, etc. runs faster when I do multiple runs, 
perhaps because I am tuning parameters (i.e., pre-processing is an occasional 
operation that I don't want to re-do every time that I train a model)

Interface
{code:java}
minibatch_preprocessor (
source_table,                     -- Name of the table containing the input 
data.
output_table,                      -- Name of the table suitable for 
mini-batching.
dependent_varname,         -- Name of the dependent variable column. 
independent_varname,      -- Expression list to evaluate for the independent 
variables.
buffer_size,                        --  buffer_size? Default should be to pack 
as much as possible in the 1GB limit imposed by postgres/gpdb.
)
{code}
 

The main purpose of the function is to prepare the training data for 
minibatching algorithms. This will be achieved in 2 stages
 # Based on the batch size, group all the dependent and independent variables 
in a single tuple representative of the batch.
 # If the independent variables are boolean or text, perform one hot encoding.  
N/A for integer and floats. Note that if the integer vars are actually 
categorical, they must be case to ::TEXT so that they get encoded.  

Notes

1) Random shuffle needed for mini-batch.
 2) Naive approach may be OK to start, not worth big investment to make run 10% 
or 20% faster.

Acceptance
Summary 

  1) Convert from standard to special format for mini-batching
  2) Standardize by default for now but the user cannot opt out of it. We may 
decide to add a flag later.
  3) Some scale testing OK (does not need to be comprehensive)
  4) Document as a helper function user docs
  5) Always ignore nulls in dependent variable
  6) IC

  was:
Related to
 https://issues.apache.org/jira/browse/MADLIB-1037
 https://issues.apache.org/jira/browse/MADLIB-1048

Story

{{As a}}
 data scientist
 {{I want to}}
 pre-process input files for use with mini-batching
 {{so that}}
 the optimization part of MLP, SVM, etc. runs faster when I do multiple runs, 
perhaps because I am tuning parameters (i.e., pre-processing is an occasional 
operation that I don't want to re-do every time that I train a model)

Interface
{code:java}
minibatch_preprocessor (
source_table,                      -- Name of the table containing the input 
data.
output_table,                      -- Name of the table suitable for 
mini-batching.
dependent_varname,                 -- Name of the dependent variable column. 
independent_varname,               -- Expression list to evaluate for the 
independent variables.
buffer_size,                       -- ??? 
){code}
 

The main purpose of the function is to prepare the training data for 
minibatching algorithms. This will be achieved in 2 stages
 # Based on the batch size, group all the dependent and independent variables 
in a single tuple representative of the batch.
 # If the independent variables are boolean or text, perform one hot encoding.  
N/A for integer and floats. Note that if the integer vars are actually 
categorical, they must be case to ::TEXT so that they get encoded.  

Notes

1) Random shuffle needed for mini-batch.
 2) Naive approach may be OK to start, not worth big investment to make run 10% 
or 20% faster.

Acceptance
Summary 

  1) Convert from standard to special format for mini-batching
  2) Standardize by default for now but the user cannot opt out of it. We may 
decide to add a flag later.
  3) Some scale testing OK (does not need to be comprehensive)
  4) Document as a helper function user docs
  5) Always ignore nulls in dependent variable
  6) IC


> Pre-processing helper function for mini-batching 
> -------------------------------------------------
>
>                 Key: MADLIB-1200
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1200
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Utilities
>            Reporter: Frank McQuillan
>            Priority: Major
>             Fix For: v1.14
>
>
> Related to
>  https://issues.apache.org/jira/browse/MADLIB-1037
>  https://issues.apache.org/jira/browse/MADLIB-1048
> Story
> {{As a}}
>  data scientist
>  {{I want to}}
>  pre-process input files for use with mini-batching
>  {{so that}}
>  the optimization part of MLP, SVM, etc. runs faster when I do multiple runs, 
> perhaps because I am tuning parameters (i.e., pre-processing is an occasional 
> operation that I don't want to re-do every time that I train a model)
> Interface
> {code:java}
> minibatch_preprocessor (
> source_table,                     -- Name of the table containing the input 
> data.
> output_table,                      -- Name of the table suitable for 
> mini-batching.
> dependent_varname,         -- Name of the dependent variable column. 
> independent_varname,      -- Expression list to evaluate for the independent 
> variables.
> buffer_size,                        --  buffer_size? Default should be to 
> pack as much as possible in the 1GB limit imposed by postgres/gpdb.
> )
> {code}
>  
> The main purpose of the function is to prepare the training data for 
> minibatching algorithms. This will be achieved in 2 stages
>  # Based on the batch size, group all the dependent and independent variables 
> in a single tuple representative of the batch.
>  # If the independent variables are boolean or text, perform one hot 
> encoding.  N/A for integer and floats. Note that if the integer vars are 
> actually categorical, they must be case to ::TEXT so that they get encoded.  
> Notes
> 1) Random shuffle needed for mini-batch.
>  2) Naive approach may be OK to start, not worth big investment to make run 
> 10% or 20% faster.
> Acceptance
> Summary 
>   1) Convert from standard to special format for mini-batching
>   2) Standardize by default for now but the user cannot opt out of it. We may 
> decide to add a flag later.
>   3) Some scale testing OK (does not need to be comprehensive)
>   4) Document as a helper function user docs
>   5) Always ignore nulls in dependent variable
>   6) IC



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (MADLIB-1200) Pre-processing helper function for mini-batching

Reply via email to