[ 
https://issues.apache.org/jira/browse/MADLIB-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan updated MADLIB-1200:
------------------------------------
    Description: 
Related to
 https://issues.apache.org/jira/browse/MADLIB-1037
 https://issues.apache.org/jira/browse/MADLIB-1048

Story

{{As a}}
 data scientist
 {{I want to}}
 pre-process input files for use with mini-batching
 {{so that}}
 the optimization part of MLP, SVM, etc. runs faster when I do multiple runs, 
perhaps because I am tuning parameters (i.e., pre-processing is an occasional 
operation that I don't want to re-do every time that I train a model)

Interface

This function is kind of the inverse of:

 

Suggested interface:

matrix_nest_1d_to_2d (
 source_table, 
 output_table,
 independent_varname,
 dependent_varname,
 n_elements — Number of elements to pack,
 encode – one hot encoding if set to true
 );

where dependent_varname is a column of 1d arrays

Or call it

array_nest_1d_to_2d ()

?

The main purpose of the function is to prepare the training data for 
minibatching algorithms. This will be achieved in 2 stages

1. Based on the batch size, group all the dependent and independent variables 
in a single tuple representative of the batch.
 2. If the encode parameter is True, perform one hot encoding for the dependent 
variable. Users will need to set encode to true for multi class SVM/MLP and 
false for single class SVM.

Notes

1) Random shuffle needed for mini-batch.
 2) Naive approach may be OK to start, not worth big investment to make run 10% 
or 20% faster.

Acceptance

1) Convert from standard to special format for mini-batching
 2) Some scale testing OK (does not need to be comprehensive)
 3) Document as a helper function user docs
 4) IC

  was:
Related to
 https://issues.apache.org/jira/browse/MADLIB-1037
 https://issues.apache.org/jira/browse/MADLIB-1048

Story

{{As a}}
 data scientist
 {{I want to}}
 pre-process input files for use with mini-batching
 {{so that}}
 the optimization part of MLP, SVM, etc. runs faster when I do multiple runs, 
perhaps because I am tuning parameters (i.e., pre-processing is a 1-time 
operation)

Interface

This function is kind of the inverse of:

array_unnest_2d_to_1d() 
 
[http://madlib.apache.org/docs/latest/array__ops_8sql__in.html#af057b589f2a2cb1095caa99feaeb3d70]

but the difference is we want to persist an output table for the packed 2d 
array.

Suggested interface:

matrix_nest_1d_to_2d (
 source_table, 
 output_table,
 independent_varname,
 dependent_varname,
 n_elements — Number of elements to pack,
 encode -- one hot encoding if set to true
 );

where dependent_varname is a column of 1d arrays

Or call it

array_nest_1d_to_2d ()

?

The main purpose of the function is to prepare the training data for 
minibatching algorithms. This will be achieved in 2 stages

1. Based on the batch size, group all the dependent and independent variables 
in a single tuple representative of the batch.
2. If the encode parameter is True, perform one hot encoding for the dependent 
variable. Users will need to set encode to true for multi class SVM/MLP and 
false for single class SVM. 


Notes

1) Random shuffle needed for mini-batch.
 2) Naive approach may be OK to start, not worth big investment to make run 10% 
or 20% faster.

Acceptance

1) Convert from standard to special format for mini-batching
 2) Some scale testing OK (does not need to be comprehensive)
 3) Document as a helper function user docs
 4) IC


> Pre-processing helper function for mini-batching 
> -------------------------------------------------
>
>                 Key: MADLIB-1200
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1200
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Utilities
>            Reporter: Frank McQuillan
>            Priority: Major
>             Fix For: v1.14
>
>
> Related to
>  https://issues.apache.org/jira/browse/MADLIB-1037
>  https://issues.apache.org/jira/browse/MADLIB-1048
> Story
> {{As a}}
>  data scientist
>  {{I want to}}
>  pre-process input files for use with mini-batching
>  {{so that}}
>  the optimization part of MLP, SVM, etc. runs faster when I do multiple runs, 
> perhaps because I am tuning parameters (i.e., pre-processing is an occasional 
> operation that I don't want to re-do every time that I train a model)
> Interface
> This function is kind of the inverse of:
>  
> Suggested interface:
> matrix_nest_1d_to_2d (
>  source_table, 
>  output_table,
>  independent_varname,
>  dependent_varname,
>  n_elements — Number of elements to pack,
>  encode – one hot encoding if set to true
>  );
> where dependent_varname is a column of 1d arrays
> Or call it
> array_nest_1d_to_2d ()
> ?
> The main purpose of the function is to prepare the training data for 
> minibatching algorithms. This will be achieved in 2 stages
> 1. Based on the batch size, group all the dependent and independent variables 
> in a single tuple representative of the batch.
>  2. If the encode parameter is True, perform one hot encoding for the 
> dependent variable. Users will need to set encode to true for multi class 
> SVM/MLP and false for single class SVM.
> Notes
> 1) Random shuffle needed for mini-batch.
>  2) Naive approach may be OK to start, not worth big investment to make run 
> 10% or 20% faster.
> Acceptance
> 1) Convert from standard to special format for mini-batching
>  2) Some scale testing OK (does not need to be comprehensive)
>  3) Document as a helper function user docs
>  4) IC



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to