[
https://issues.apache.org/jira/browse/MADLIB-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nikhil updated MADLIB-1200:
---------------------------
Description:
Related to
https://issues.apache.org/jira/browse/MADLIB-1037
https://issues.apache.org/jira/browse/MADLIB-1048
Story
{{As a}}
data scientist
{{I want to}}
pre-process input files for use with mini-batching
{{so that}}
the optimization part of MLP, SVM, etc. runs faster when I do multiple runs,
perhaps because I am tuning parameters (i.e., pre-processing is an occasional
operation that I don't want to re-do every time that I train a model)
Interface
{code:java}
minibatch_preprocessor (
source_table, -- Name of the table containing the input
data.
output_table, -- Name of the table suitable for
mini-batching.
dependent_varname, -- Name of the dependent variable column.
independent_varname, -- Expression list to evaluate for the independent
variables.
buffer_size, -- buffer_size? Default should be to pack
as much as possible in the 1GB limit imposed by postgres/gpdb.
)
{code}
The main purpose of the function is to prepare the training data for
minibatching algorithms. This will be achieved in 2 stages
# Based on the batch size, group all the dependent and independent variables
in a single tuple representative of the batch.
# If the independent variables are boolean or text, perform one hot encoding.
N/A for integer and floats. Note that if the integer vars are actually
categorical, they must be case to ::TEXT so that they get encoded.
Notes
1) Random shuffle needed for mini-batch.
2) Naive approach may be OK to start, not worth big investment to make run 10%
or 20% faster.
Acceptance
Summary
1) Convert from standard to special format for mini-batching
2) Standardize by default for now but the user cannot opt out of it. We may
decide to add a flag later.
3) Some scale testing OK (does not need to be comprehensive)
4) Document as a helper function user docs
5) Always ignore nulls in dependent variable
6) IC
was:
Related to
https://issues.apache.org/jira/browse/MADLIB-1037
https://issues.apache.org/jira/browse/MADLIB-1048
Story
{{As a}}
data scientist
{{I want to}}
pre-process input files for use with mini-batching
{{so that}}
the optimization part of MLP, SVM, etc. runs faster when I do multiple runs,
perhaps because I am tuning parameters (i.e., pre-processing is an occasional
operation that I don't want to re-do every time that I train a model)
Interface
{code:java}
minibatch_preprocessor (
source_table, -- Name of the table containing the input
data.
output_table, -- Name of the table suitable for
mini-batching.
dependent_varname, -- Name of the dependent variable column.
independent_varname, -- Expression list to evaluate for the
independent variables.
buffer_size, -- ???
){code}
The main purpose of the function is to prepare the training data for
minibatching algorithms. This will be achieved in 2 stages
# Based on the batch size, group all the dependent and independent variables
in a single tuple representative of the batch.
# If the independent variables are boolean or text, perform one hot encoding.
N/A for integer and floats. Note that if the integer vars are actually
categorical, they must be case to ::TEXT so that they get encoded.
Notes
1) Random shuffle needed for mini-batch.
2) Naive approach may be OK to start, not worth big investment to make run 10%
or 20% faster.
Acceptance
Summary
1) Convert from standard to special format for mini-batching
2) Standardize by default for now but the user cannot opt out of it. We may
decide to add a flag later.
3) Some scale testing OK (does not need to be comprehensive)
4) Document as a helper function user docs
5) Always ignore nulls in dependent variable
6) IC
> Pre-processing helper function for mini-batching
> -------------------------------------------------
>
> Key: MADLIB-1200
> URL: https://issues.apache.org/jira/browse/MADLIB-1200
> Project: Apache MADlib
> Issue Type: New Feature
> Components: Module: Utilities
> Reporter: Frank McQuillan
> Priority: Major
> Fix For: v1.14
>
>
> Related to
> https://issues.apache.org/jira/browse/MADLIB-1037
> https://issues.apache.org/jira/browse/MADLIB-1048
> Story
> {{As a}}
> data scientist
> {{I want to}}
> pre-process input files for use with mini-batching
> {{so that}}
> the optimization part of MLP, SVM, etc. runs faster when I do multiple runs,
> perhaps because I am tuning parameters (i.e., pre-processing is an occasional
> operation that I don't want to re-do every time that I train a model)
> Interface
> {code:java}
> minibatch_preprocessor (
> source_table, -- Name of the table containing the input
> data.
> output_table, -- Name of the table suitable for
> mini-batching.
> dependent_varname, -- Name of the dependent variable column.
> independent_varname, -- Expression list to evaluate for the independent
> variables.
> buffer_size, -- buffer_size? Default should be to
> pack as much as possible in the 1GB limit imposed by postgres/gpdb.
> )
> {code}
>
> The main purpose of the function is to prepare the training data for
> minibatching algorithms. This will be achieved in 2 stages
> # Based on the batch size, group all the dependent and independent variables
> in a single tuple representative of the batch.
> # If the independent variables are boolean or text, perform one hot
> encoding. N/A for integer and floats. Note that if the integer vars are
> actually categorical, they must be case to ::TEXT so that they get encoded.
> Notes
> 1) Random shuffle needed for mini-batch.
> 2) Naive approach may be OK to start, not worth big investment to make run
> 10% or 20% faster.
> Acceptance
> Summary
> 1) Convert from standard to special format for mini-batching
> 2) Standardize by default for now but the user cannot opt out of it. We may
> decide to add a flag later.
> 3) Some scale testing OK (does not need to be comprehensive)
> 4) Document as a helper function user docs
> 5) Always ignore nulls in dependent variable
> 6) IC
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)