[
https://issues.apache.org/jira/browse/MADLIB-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381233#comment-16381233
]
Frank McQuillan edited comment on MADLIB-1200 at 3/1/18 10:35 PM:
------------------------------------------------------------------
I added more details to the interface in the description section above.
Also updated the output table a bit to include:
{code}
dependent_vartype “Continuous” or “Categorical”
class_values Class values of the dependent variable (NULL
for non categorical vars, i,e., if class_values=”Regression”).
{code}
was (Author: fmcquillan):
I added more details to the interface in the description section above.
Also updated the output table a bit to include:
{code}
model type “Classification” or “Regression”
class_values Class values of the dependent variable (NULL
for non categorical vars, i,e., if class_values=”Regression”).
{code}
> Pre-processing helper function for mini-batching
> -------------------------------------------------
>
> Key: MADLIB-1200
> URL: https://issues.apache.org/jira/browse/MADLIB-1200
> Project: Apache MADlib
> Issue Type: New Feature
> Components: Module: Utilities
> Reporter: Frank McQuillan
> Assignee: Jingyi Mei
> Priority: Major
> Fix For: v1.14
>
>
> Related to
> https://issues.apache.org/jira/browse/MADLIB-1037
> https://issues.apache.org/jira/browse/MADLIB-1048
> Story
> {{As a}}
> data scientist
> {{I want to}}
> pre-process input files for use with mini-batching
> {{so that}}
> the optimization part of MLP, SVM, etc. runs faster when I do multiple runs,
> perhaps because I am tuning parameters (i.e., pre-processing is an occasional
> operation that I don't want to re-do every time that I train a model)
> Interface
> {code}
> minibatch_preprocessor(
> source_table, -- Name of the table containing input data
> output_table, -- Name of the output table for mini-batching
> dependent_varname, -- Name of the dependent variable column
> independent_varname, -- Expression list to evaluate for the independent
> variables
> grouping_cols -- Preprocess separately by group
> )
> {code}
> where
> {code}
> source_table
> TEXT. Name of the table containing input data. Can also be a view.
> output_table
> TEXT. Name of the output table from the preprocessor which will be used as
> input to algorithms that support mini-batching.
> dependent_varname
> TEXT. Column name or expression to evaluate for the dependent variable.
> independent_varname
> TEXT. Column name or expression list to evaluate for the independent
> variable. Will be cast to double when packing.
> grouping_cols (optional)
> TEXT, default: NULL. An expression list used to group the input dataset into
> discrete groups, running one preprocessing step per group. Similar to the SQL
> GROUP BY clause. When this value is NULL, no grouping is used and a single
> preprocessing step is performed for the whole data set.
> The output table contains the following columns:
> id INTEGER. Unique id for packed table.
> dependent_varname FLOAT8[]. Packed array of dependent
> variables.
> independent_varname FLOAT8[]. Packed array of independent
> variables.
> grouping_cols TEXT. Name of grouping columns.
> A summary table named <output_table>_summary is created together with the
> output table. It has the following columns:
> source_table Source table name.
> output_table Output table name from preprocessor.
> dependent_varname Dependent variable.
> independent_varname Independent variables.
> buffer_size Buffer size used in preprocessing step.
> dependent_vartype “Continuous” or “Categorical”
> class_values Class values of the dependent variable (NULL
> for continuous vars).
> num_rows_processed The total number of rows that were used in the
> computation.
> num_missing_rows_skipped The total number of rows that were skipped
> because of NULL values in them.
> grouping_cols Names of the grouping columns.
> A standardization table named <output_table>_standardization is created
> together with the output table. It has the following columns:
> grouping_cols Group
> mean Mean of independent vars by group
> std Standard deviation of independent vars
> by group
> {code}
>
> The main purpose of the function is to prepare the training data for
> minibatching algorithms. This will be achieved in 2 stages
> # Based on the batch size, group all the dependent and independent variables
> in a single tuple representative of the batch.
> # If the independent variables are boolean or text, perform one hot
> encoding. N/A for integer and floats. Note that if the integer vars are
> actually categorical, they must be case to ::TEXT so that they get encoded.
> Notes
> 1) Random shuffle needed for mini-batch.
> 2) Naive approach may be OK to start, not worth big investment to make run
> 10% or 20% faster.
> Acceptance
> Summary
> 1) Convert from standard to special format for mini-batching
> 2) Standardize by default for now but the user cannot opt out of it. We may
> decide to add a flag later.
> 3) Some scale testing OK (does not need to be comprehensive)
> 4) Document as a helper function user docs
> 5) Always ignore nulls in dependent variable
> 6) IC
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)