[ https://issues.apache.org/jira/browse/MADLIB-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Frank McQuillan updated MADLIB-1200: ------------------------------------ Description: Related to https://issues.apache.org/jira/browse/MADLIB-1037 https://issues.apache.org/jira/browse/MADLIB-1048 Story {{As a}} data scientist {{I want to}} pre-process input files for use with mini-batching {{so that}} the optimization part of MLP, SVM, etc. runs faster when I do multiple runs, perhaps because I am tuning parameters (i.e., pre-processing is an occasional operation that I don't want to re-do every time that I train a model) Interface {code} minibatch_preprocessor( source_table, -- Name of the table containing input data output_table, -- Name of the output table for mini-batching dependent_varname, -- Name of the dependent variable column independent_varname, -- Expression list to evaluate for the independent variables grouping_cols -- Preprocess separately by group ) {code} where {code} source_table TEXT. Name of the table containing input data. Can also be a view. output_table TEXT. Name of the output table from the preprocessor which will be used as input to algorithms that support mini-batching. dependent_varname TEXT. Column name or expression to evaluate for the dependent variable. independent_varname TEXT. Column name or expression list to evaluate for the independent variable. Will be cast to double when packing. grouping_cols (optional) TEXT, default: NULL. An expression list used to group the input dataset into discrete groups, running one preprocessing step per group. Similar to the SQL GROUP BY clause. When this value is NULL, no grouping is used and a single preprocessing step is performed for the whole data set. The output table contains the following columns: id INTEGER. Unique id for packed table. dependent_varname FLOAT8[]. Packed array of dependent variables. independent_varname FLOAT8[]. Packed array of independent variables. grouping_cols TEXT. Name of grouping columns. A summary table named <output_table>_summary is created together with the output table. It has the following columns: source_table Source table name. output_table Output table name from preprocessor. dependent_varname Dependent variable. independent_varname Independent variables. buffer_size Buffer size used in preprocessing step. model type “Classification” or “Regression” class_values Class values of the dependent variable (NULL for non categorical vars, i,e., if class_values=”Regression”). num_rows_processed The total number of rows that were used in the computation. num_missing_rows_skipped The total number of rows that were skipped because of NULL values in them. grouping_cols Names of the grouping columns. A standardization table named <output_table>_standardization is created together with the output table. It has the following columns: grouping_cols Group mean Mean of independent vars by group std Standard deviation of independent vars by group {code} The main purpose of the function is to prepare the training data for minibatching algorithms. This will be achieved in 2 stages # Based on the batch size, group all the dependent and independent variables in a single tuple representative of the batch. # If the independent variables are boolean or text, perform one hot encoding. N/A for integer and floats. Note that if the integer vars are actually categorical, they must be case to ::TEXT so that they get encoded. Notes 1) Random shuffle needed for mini-batch. 2) Naive approach may be OK to start, not worth big investment to make run 10% or 20% faster. Acceptance Summary 1) Convert from standard to special format for mini-batching 2) Standardize by default for now but the user cannot opt out of it. We may decide to add a flag later. 3) Some scale testing OK (does not need to be comprehensive) 4) Document as a helper function user docs 5) Always ignore nulls in dependent variable 6) IC was: Related to https://issues.apache.org/jira/browse/MADLIB-1037 https://issues.apache.org/jira/browse/MADLIB-1048 Story {{As a}} data scientist {{I want to}} pre-process input files for use with mini-batching {{so that}} the optimization part of MLP, SVM, etc. runs faster when I do multiple runs, perhaps because I am tuning parameters (i.e., pre-processing is an occasional operation that I don't want to re-do every time that I train a model) Interface {code} minibatch_preprocessor( source_table, -- Name of the table containing input data output_table, -- Name of the output table for mini-batching dependent_varname, -- Name of the dependent variable column independent_varname, -- Expression list to evaluate for the independent variables grouping_cols -- Preprocess separately by group ) {code} where {code} source_table TEXT. Name of the table containing input data. Can also be a view. output_table TEXT. Name of the output table from the preprocessor which will be used as input to algorithms that support mini-batching. The output table contains the following columns: id INTEGER. Unique id for packed table. dependent_varname FLOAT8[]. Packed array of dependent variables. independent_varname FLOAT8[]. Packed array of independent variables. grouping_cols TEXT. Name of grouping columns. A summary table named <output_table>_summary is created together with the output table. It has the following columns: source_table Source table name. output_table Output table name from preprocessor. dependent_varname Dependent variable. independent_varname Independent variables. buffer_size Buffer size used in preprocessing step. model type “Classification” or “Regression” class_values Class values of the dependent variable (NULL for non categorical vars, i,e., if class_values=”Regression”). num_rows_processed The total number of rows that were used in the computation. num_missing_rows_skipped The total number of rows that were skipped because of NULL values in them. grouping_cols Names of the grouping columns. A standardization table named <output_table>_standardization is created together with the output table. It has the following columns: grouping_cols Group mean Mean of independent vars by group std Standard deviation of independent vars by group dependent_varname TEXT. Column name or expression to evaluate for the dependent variable. independent_varname TEXT. Column name or expression list to evaluate for the independent variable. Will be cast to double when packing. grouping_cols (optional) TEXT, default: NULL. An expression list used to group the input dataset into discrete groups, running one preprocessing step per group. Similar to the SQL GROUP BY clause. When this value is NULL, no grouping is used and a single preprocessing step is performed for the whole data set. {code} The main purpose of the function is to prepare the training data for minibatching algorithms. This will be achieved in 2 stages # Based on the batch size, group all the dependent and independent variables in a single tuple representative of the batch. # If the independent variables are boolean or text, perform one hot encoding. N/A for integer and floats. Note that if the integer vars are actually categorical, they must be case to ::TEXT so that they get encoded. Notes 1) Random shuffle needed for mini-batch. 2) Naive approach may be OK to start, not worth big investment to make run 10% or 20% faster. Acceptance Summary 1) Convert from standard to special format for mini-batching 2) Standardize by default for now but the user cannot opt out of it. We may decide to add a flag later. 3) Some scale testing OK (does not need to be comprehensive) 4) Document as a helper function user docs 5) Always ignore nulls in dependent variable 6) IC > Pre-processing helper function for mini-batching > ------------------------------------------------- > > Key: MADLIB-1200 > URL: https://issues.apache.org/jira/browse/MADLIB-1200 > Project: Apache MADlib > Issue Type: New Feature > Components: Module: Utilities > Reporter: Frank McQuillan > Assignee: Jingyi Mei > Priority: Major > Fix For: v1.14 > > > Related to > https://issues.apache.org/jira/browse/MADLIB-1037 > https://issues.apache.org/jira/browse/MADLIB-1048 > Story > {{As a}} > data scientist > {{I want to}} > pre-process input files for use with mini-batching > {{so that}} > the optimization part of MLP, SVM, etc. runs faster when I do multiple runs, > perhaps because I am tuning parameters (i.e., pre-processing is an occasional > operation that I don't want to re-do every time that I train a model) > Interface > {code} > minibatch_preprocessor( > source_table, -- Name of the table containing input data > output_table, -- Name of the output table for mini-batching > dependent_varname, -- Name of the dependent variable column > independent_varname, -- Expression list to evaluate for the independent > variables > grouping_cols -- Preprocess separately by group > ) > {code} > where > {code} > source_table > TEXT. Name of the table containing input data. Can also be a view. > output_table > TEXT. Name of the output table from the preprocessor which will be used as > input to algorithms that support mini-batching. > dependent_varname > TEXT. Column name or expression to evaluate for the dependent variable. > independent_varname > TEXT. Column name or expression list to evaluate for the independent > variable. Will be cast to double when packing. > grouping_cols (optional) > TEXT, default: NULL. An expression list used to group the input dataset into > discrete groups, running one preprocessing step per group. Similar to the SQL > GROUP BY clause. When this value is NULL, no grouping is used and a single > preprocessing step is performed for the whole data set. > The output table contains the following columns: > id INTEGER. Unique id for packed table. > dependent_varname FLOAT8[]. Packed array of dependent > variables. > independent_varname FLOAT8[]. Packed array of independent > variables. > grouping_cols TEXT. Name of grouping columns. > A summary table named <output_table>_summary is created together with the > output table. It has the following columns: > source_table Source table name. > output_table Output table name from preprocessor. > dependent_varname Dependent variable. > independent_varname Independent variables. > buffer_size Buffer size used in preprocessing step. > model type “Classification” or “Regression” > class_values Class values of the dependent variable (NULL > for non categorical vars, i,e., if class_values=”Regression”). > num_rows_processed The total number of rows that were used in the > computation. > num_missing_rows_skipped The total number of rows that were skipped > because of NULL values in them. > grouping_cols Names of the grouping columns. > A standardization table named <output_table>_standardization is created > together with the output table. It has the following columns: > grouping_cols Group > mean Mean of independent vars by group > std Standard deviation of independent vars > by group > {code} > > The main purpose of the function is to prepare the training data for > minibatching algorithms. This will be achieved in 2 stages > # Based on the batch size, group all the dependent and independent variables > in a single tuple representative of the batch. > # If the independent variables are boolean or text, perform one hot > encoding. N/A for integer and floats. Note that if the integer vars are > actually categorical, they must be case to ::TEXT so that they get encoded. > Notes > 1) Random shuffle needed for mini-batch. > 2) Naive approach may be OK to start, not worth big investment to make run > 10% or 20% faster. > Acceptance > Summary > 1) Convert from standard to special format for mini-batching > 2) Standardize by default for now but the user cannot opt out of it. We may > decide to add a flag later. > 3) Some scale testing OK (does not need to be comprehensive) > 4) Document as a helper function user docs > 5) Always ignore nulls in dependent variable > 6) IC -- This message was sent by Atlassian JIRA (v7.6.3#76005)