[ 
https://issues.apache.org/jira/browse/MADLIB-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16341379#comment-16341379
 ] 

Rahul Iyer commented on MADLIB-1200:
------------------------------------

The {{batch_size}} parameter in {{minibatch}} means something different from 
what is being implied here - the {{batch_size}} here is referred to as 
{{buffer_size}} in {{minibatch}}. Further, there is some meaning to having a 
default value for the {{buffer_size}} - it implies that the user is OK with the 
value optimized for performance. 

I'm not completely convinced that encode should be part of this function. If it 
is, the output should include the order of the encoding, so that the user can 
make sense of the output. Further, the downstream module (like MLP) can't 
output the actual categorical value in *{{predict}}, unless the order of 
encoding is also given as input to the module. This would imply different 
formats of input to the downstream module - the encoded input is provided for 
{{minibatch}} and un-encoded input for other solvers. 

> Pre-processing helper function for mini-batching 
> -------------------------------------------------
>
>                 Key: MADLIB-1200
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1200
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Utilities
>            Reporter: Frank McQuillan
>            Priority: Major
>             Fix For: v1.14
>
>
> Related to
>  https://issues.apache.org/jira/browse/MADLIB-1037
>  https://issues.apache.org/jira/browse/MADLIB-1048
> Story
> {{As a}}
>  data scientist
>  {{I want to}}
>  pre-process input files for use with mini-batching
>  {{so that}}
>  the optimization part of MLP, SVM, etc. runs faster when I do multiple runs, 
> perhaps because I am tuning parameters (i.e., pre-processing is an occasional 
> operation that I don't want to re-do every time that I train a model)
> Interface
> This function is kind of the inverse of:
>  
> Suggested interface:
> {code:java}
> minibatch_preprocessor (
> source_table, 
> output_table,
> dependent_varname,
> independent_varname,
> batch_size,                          – Number of elements to pack
> encode                               – One-hot encoding if set to TRUE
> ){code}
>  
> The main purpose of the function is to prepare the training data for 
> minibatching algorithms. This will be achieved in 2 stages
> 1. Based on the batch size, group all the dependent and independent variables 
> in a single tuple representative of the batch.
>  2. If the encode parameter is True, perform one hot encoding for the 
> dependent variable. Users will need to set encode to true for multi class 
> SVM/MLP and false for single class SVM.
> Notes
> 1) Random shuffle needed for mini-batch.
>  2) Naive approach may be OK to start, not worth big investment to make run 
> 10% or 20% faster.
> Acceptance
> 1) Convert from standard to special format for mini-batching
>  2) Some scale testing OK (does not need to be comprehensive)
>  3) Document as a helper function user docs
>  4) IC



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to