[
https://issues.apache.org/jira/browse/MADLIB-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16339681#comment-16339681
]
Nikhil commented on MADLIB-1200:
--------------------------------
I would suggest naming the function in a way that is reflective of it's purpose
So something like
{code}
minibatch_preprocessing
{code}
> Pre-processing helper function for mini-batching
> -------------------------------------------------
>
> Key: MADLIB-1200
> URL: https://issues.apache.org/jira/browse/MADLIB-1200
> Project: Apache MADlib
> Issue Type: New Feature
> Components: Module: Utilities
> Reporter: Frank McQuillan
> Priority: Major
> Fix For: v1.14
>
>
> Related to
> https://issues.apache.org/jira/browse/MADLIB-1037
> https://issues.apache.org/jira/browse/MADLIB-1048
> Story
> {{As a}}
> data scientist
> {{I want to}}
> pre-process input files for use with mini-batching
> {{so that}}
> the optimization part of MLP, SVM, etc. runs faster when I do multiple runs,
> perhaps because I am tuning parameters (i.e., pre-processing is a 1-time
> operation)
> Interface
> This function is kind of the inverse of:
> array_unnest_2d_to_1d()
>
> [http://madlib.apache.org/docs/latest/array__ops_8sql__in.html#af057b589f2a2cb1095caa99feaeb3d70]
> but the difference is we want to persist an output table for the packed 2d
> array.
> Suggested interface:
> matrix_nest_1d_to_2d (
> source_table,
> output_table,
> independent_varname,
> dependent_varname,
> n_elements — Number of elements to pack,
> encode – one hot encoding default True
> );
> where dependent_varname is a column of 1d arrays
> Or call it
> array_nest_1d_to_2d ()
> ?
> Notes
> 1) Random shuffle needed for mini-batch.
> 2) Naive approach may be OK to start, not worth big investment to make run
> 10% or 20% faster.
> Acceptance
> 1) Convert from standard to special format for mini-batching
> 2) Some scale testing OK (does not need to be comprehensive)
> 3) Document as a helper function user docs
> 4) IC
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)