[ 
https://issues.apache.org/jira/browse/MADLIB-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikhil updated MADLIB-1200:
---------------------------
    Description: 
Related to
 https://issues.apache.org/jira/browse/MADLIB-1037
 https://issues.apache.org/jira/browse/MADLIB-1048

Story

{{As a}}
 data scientist
 {{I want to}}
 pre-process input files for use with mini-batching
 {{so that}}
 the optimization part of MLP, SVM, etc. runs faster when I do multiple runs, 
perhaps because I am tuning parameters (i.e., pre-processing is a 1-time 
operation)

Interface

This function is kind of the inverse of:

array_unnest_2d_to_1d() 
 
[http://madlib.apache.org/docs/latest/array__ops_8sql__in.html#af057b589f2a2cb1095caa99feaeb3d70]

but the difference is we want to persist an output table for the packed 2d 
array.

Suggested interface:

matrix_nest_1d_to_2d (
 source_table, 
 output_table,
 independent_varname,
 dependent_varname,
 n_elements — Number of elements to pack,

encode – default True
 );

where dependent_varname is a column of 1d arrays

Or call it

array_nest_1d_to_2d ()

?

Notes

1) Random shuffle needed for mini-batch.
 2) Naive approach may be OK to start, not worth big investment to make run 10% 
or 20% faster.

Acceptance

1) Convert from standard to special format for mini-batching
 2) Some scale testing OK (does not need to be comprehensive)
 3) Document as a helper function user docs
 4) IC

  was:
Related to
https://issues.apache.org/jira/browse/MADLIB-1037
https://issues.apache.org/jira/browse/MADLIB-1048

Story

{{As a}}
data scientist
{{I want to}}
pre-process input files for use with mini-batching
{{so that}}
the optimization part of MLP, SVM, etc. runs faster when I do multiple runs, 
perhaps because I am tuning parameters (i.e., pre-processing is a 1-time 
operation)

Interface

This function is kind of the inverse of:

array_unnest_2d_to_1d() 
http://madlib.apache.org/docs/latest/array__ops_8sql__in.html#af057b589f2a2cb1095caa99feaeb3d70

but the difference is we want to persist an output table for the packed 2d 
array.

Suggested interface:

matrix_nest_1d_to_2d (
 source_table, 
 output_table,
 independent_varname,
 dependent_varname,
 n_elements — Number of elements to pack
 );

where dependent_varname is a column of 1d arrays

Or call it

array_nest_1d_to_2d ()

?

Notes


1) Random shuffle needed for mini-batch.
2) Naive approach may be OK to start, not worth big investment to make run 10% 
or 20% faster.

Acceptance

1) Convert from standard to special format for mini-batching
2) Some scale testing OK (does not need to be comprehensive)
3) Document as a helper function user docs
4) IC


> Pre-processing helper function for mini-batching 
> -------------------------------------------------
>
>                 Key: MADLIB-1200
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1200
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Utilities
>            Reporter: Frank McQuillan
>            Priority: Major
>             Fix For: v1.14
>
>
> Related to
>  https://issues.apache.org/jira/browse/MADLIB-1037
>  https://issues.apache.org/jira/browse/MADLIB-1048
> Story
> {{As a}}
>  data scientist
>  {{I want to}}
>  pre-process input files for use with mini-batching
>  {{so that}}
>  the optimization part of MLP, SVM, etc. runs faster when I do multiple runs, 
> perhaps because I am tuning parameters (i.e., pre-processing is a 1-time 
> operation)
> Interface
> This function is kind of the inverse of:
> array_unnest_2d_to_1d() 
>  
> [http://madlib.apache.org/docs/latest/array__ops_8sql__in.html#af057b589f2a2cb1095caa99feaeb3d70]
> but the difference is we want to persist an output table for the packed 2d 
> array.
> Suggested interface:
> matrix_nest_1d_to_2d (
>  source_table, 
>  output_table,
>  independent_varname,
>  dependent_varname,
>  n_elements — Number of elements to pack,
> encode – default True
>  );
> where dependent_varname is a column of 1d arrays
> Or call it
> array_nest_1d_to_2d ()
> ?
> Notes
> 1) Random shuffle needed for mini-batch.
>  2) Naive approach may be OK to start, not worth big investment to make run 
> 10% or 20% faster.
> Acceptance
> 1) Convert from standard to special format for mini-batching
>  2) Some scale testing OK (does not need to be comprehensive)
>  3) Document as a helper function user docs
>  4) IC



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to