[
https://issues.apache.org/jira/browse/MADLIB-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437536#comment-16437536
]
Advertising
Frank McQuillan commented on MADLIB-1220:
-----------------------------------------
Done in
https://github.com/apache/madlib/pull/254
https://github.com/apache/madlib/pull/256
> Pre-processing helper function for mini-batching - grouping
> ------------------------------------------------------------
>
> Key: MADLIB-1220
> URL: https://issues.apache.org/jira/browse/MADLIB-1220
> Project: Apache MADlib
> Issue Type: New Feature
> Components: Module: Utilities
> Reporter: Nikhil
> Assignee: Nikhil
> Priority: Major
> Fix For: v1.14
>
>
> Related to
> https://issues.apache.org/jira/browse/MADLIB-1200
> Story
> {{As a}}
> data scientist
> {{I want to}}
> add grouping to mini-batch pre-process
> {{so that}}
> I can handle groups with a single operation.
> Interface
> {code:java}
> minibatch_preprocessor(
> source_table, -- Name of the table containing input data
> output_table, -- Name of the output table for mini-batching
> dependent_varname, -- Name of the dependent variable column
> independent_varname, -- Expression list to evaluate for the independent
> variables
> grouping_cols, -- Preprocess separately by group
> buffer_size -- Number of source input rows to pack into batch
> )
> {code}
> where
> {code:java}
> source_table
> TEXT. Name of the table containing input data. Can also be a view.
> output_table
> TEXT. Name of the output table from the preprocessor which will be used as
> input to algorithms that support mini-batching.
> dependent_varname
> TEXT. Column name or expression to evaluate for the dependent variable.
> independent_varname
> TEXT. Column name or expression list to evaluate for the independent
> variable. Will be cast to double when packing.
> grouping_cols (optional)
> TEXT, default: NULL. An expression list used to group the input dataset into
> discrete groups, running one preprocessing step per group. Similar to the SQL
> GROUP BY clause. When this value is NULL, no grouping is used and a single
> preprocessing step is performed for the whole data set.
> buffer_size (optional) INTEGER, default: ???. Number of source input rows to
> pack into batch.
> {code}
> The output table contains the following columns:
> {code:java}
> id INTEGER. Unique id for packed table.
> dependent_varname FLOAT8[]. Packed array of dependent
> variables.
> independent_varname FLOAT8[]. Packed array of independent
> variables.
> grouping_cols TEXT. Name of grouping columns.
> {code}
> A summary table named <output_table>_summary is created together with the
> output table. It has the following columns:
> {code:java}
> source_table Source table name.
> output_table Output table name from preprocessor.
> dependent_varname Dependent variable.
> independent_varname Independent variables.
> buffer_size Buffer size used in preprocessing step.
> dependent_vartype “Continuous” or “Categorical”
> class_values Class values of the dependent variable (NULL
> for continuous vars).
> num_rows_processed The total number of rows that were used in the
> computation.
> num_missing_rows_skipped The total number of rows that were skipped
> because of NULL values in them.
> grouping_cols Names of the grouping columns.
> {code}
> A standardization table named <output_table>_standardization is created
> together with the output table. It has the following columns:
> {code:java}
> <grouping_col_expression> Group -- this column doesn't exist if
> no grouping column was specified
> mean Mean of independent vars by group
> std Standard deviation of independent vars
> by group
> {code}
>
> Acceptance
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)