[
https://issues.apache.org/jira/browse/MADLIB-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16339688#comment-16339688
]
Nikhil commented on MADLIB-1200:
--------------------------------
The main purpose of the function is to prepare the training data for
minibatching algorithms. This will be achieved in 2 stages
1. Based on the batch size, group all the dependent and independent variables
in a single tuple representative of the batch.
2. If the encode parameter is True, perform one hot encoding for the dependent
variable. Users will need to set encode to true for multi class SVM/MLP and
false for single class SVM.
> Pre-processing helper function for mini-batching
> -------------------------------------------------
>
> Key: MADLIB-1200
> URL: https://issues.apache.org/jira/browse/MADLIB-1200
> Project: Apache MADlib
> Issue Type: New Feature
> Components: Module: Utilities
> Reporter: Frank McQuillan
> Priority: Major
> Fix For: v1.14
>
>
> Related to
> https://issues.apache.org/jira/browse/MADLIB-1037
> https://issues.apache.org/jira/browse/MADLIB-1048
> Story
> {{As a}}
> data scientist
> {{I want to}}
> pre-process input files for use with mini-batching
> {{so that}}
> the optimization part of MLP, SVM, etc. runs faster when I do multiple runs,
> perhaps because I am tuning parameters (i.e., pre-processing is a 1-time
> operation)
> Interface
> This function is kind of the inverse of:
> array_unnest_2d_to_1d()
>
> [http://madlib.apache.org/docs/latest/array__ops_8sql__in.html#af057b589f2a2cb1095caa99feaeb3d70]
> but the difference is we want to persist an output table for the packed 2d
> array.
> Suggested interface:
> matrix_nest_1d_to_2d (
> source_table,
> output_table,
> independent_varname,
> dependent_varname,
> n_elements — Number of elements to pack,
> encode – one hot encoding default True
> );
> where dependent_varname is a column of 1d arrays
> Or call it
> array_nest_1d_to_2d ()
> ?
> Notes
> 1) Random shuffle needed for mini-batch.
> 2) Naive approach may be OK to start, not worth big investment to make run
> 10% or 20% faster.
> Acceptance
> 1) Convert from standard to special format for mini-batching
> 2) Some scale testing OK (does not need to be comprehensive)
> 3) Document as a helper function user docs
> 4) IC
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)