[ https://issues.apache.org/jira/browse/MADLIB-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Frank McQuillan resolved MADLIB-1220. ------------------------------------- Resolution: Fixed > Pre-processing helper function for mini-batching - grouping > ------------------------------------------------------------ > > Key: MADLIB-1220 > URL: https://issues.apache.org/jira/browse/MADLIB-1220 > Project: Apache MADlib > Issue Type: New Feature > Components: Module: Utilities > Reporter: Nikhil > Assignee: Nikhil > Priority: Major > Fix For: v1.14 > > > Related to > https://issues.apache.org/jira/browse/MADLIB-1200 > Story > {{As a}} > data scientist > {{I want to}} > add grouping to mini-batch pre-process > {{so that}} > I can handle groups with a single operation. > Interface > {code:java} > minibatch_preprocessor( > source_table, -- Name of the table containing input data > output_table, -- Name of the output table for mini-batching > dependent_varname, -- Name of the dependent variable column > independent_varname, -- Expression list to evaluate for the independent > variables > grouping_cols, -- Preprocess separately by group > buffer_size -- Number of source input rows to pack into batch > ) > {code} > where > {code:java} > source_table > TEXT. Name of the table containing input data. Can also be a view. > output_table > TEXT. Name of the output table from the preprocessor which will be used as > input to algorithms that support mini-batching. > dependent_varname > TEXT. Column name or expression to evaluate for the dependent variable. > independent_varname > TEXT. Column name or expression list to evaluate for the independent > variable. Will be cast to double when packing. > grouping_cols (optional) > TEXT, default: NULL. An expression list used to group the input dataset into > discrete groups, running one preprocessing step per group. Similar to the SQL > GROUP BY clause. When this value is NULL, no grouping is used and a single > preprocessing step is performed for the whole data set. > buffer_size (optional) INTEGER, default: ???. Number of source input rows to > pack into batch. > {code} > The output table contains the following columns: > {code:java} > id INTEGER. Unique id for packed table. > dependent_varname FLOAT8[]. Packed array of dependent > variables. > independent_varname FLOAT8[]. Packed array of independent > variables. > grouping_cols TEXT. Name of grouping columns. > {code} > A summary table named <output_table>_summary is created together with the > output table. It has the following columns: > {code:java} > source_table Source table name. > output_table Output table name from preprocessor. > dependent_varname Dependent variable. > independent_varname Independent variables. > buffer_size Buffer size used in preprocessing step. > dependent_vartype “Continuous” or “Categorical” > class_values Class values of the dependent variable (NULL > for continuous vars). > num_rows_processed The total number of rows that were used in the > computation. > num_missing_rows_skipped The total number of rows that were skipped > because of NULL values in them. > grouping_cols Names of the grouping columns. > {code} > A standardization table named <output_table>_standardization is created > together with the output table. It has the following columns: > {code:java} > <grouping_col_expression> Group -- this column doesn't exist if > no grouping column was specified > mean Mean of independent vars by group > std Standard deviation of independent vars > by group > {code} > > Acceptance -- This message was sent by Atlassian JIRA (v7.6.3#76005)