[ 
https://issues.apache.org/jira/browse/MADLIB-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437536#comment-16437536
 ] 

Frank McQuillan commented on MADLIB-1220:
-----------------------------------------

Done in

https://github.com/apache/madlib/pull/254
https://github.com/apache/madlib/pull/256

> Pre-processing helper function for mini-batching - grouping 
> ------------------------------------------------------------
>
>                 Key: MADLIB-1220
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1220
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Utilities
>            Reporter: Nikhil
>            Assignee: Nikhil
>            Priority: Major
>             Fix For: v1.14
>
>
> Related to
>  https://issues.apache.org/jira/browse/MADLIB-1200
> Story
> {{As a}}
>  data scientist
>  {{I want to}}
>  add grouping to mini-batch pre-process
>  {{so that}}
>  I can handle groups with a single operation.
> Interface
> {code:java}
> minibatch_preprocessor(       
>      source_table, -- Name of the table containing input data
>      output_table, -- Name of the output table for mini-batching
>      dependent_varname, -- Name of the dependent variable column      
>      independent_varname, -- Expression list to evaluate for the independent 
> variables
>     grouping_cols, -- Preprocess separately by group
>     buffer_size  -- Number of source input rows to pack into batch
> )
> {code}
> where
> {code:java}
> source_table
> TEXT.  Name of the table containing input data.  Can also be a view.
> output_table
> TEXT.  Name of the output table from the preprocessor which will be used as 
> input to algorithms that support mini-batching.
> dependent_varname
> TEXT.  Column name or expression to evaluate for the dependent variable. 
> independent_varname
> TEXT.  Column name or expression list to evaluate for the independent 
> variable.  Will be cast to double when packing.
> grouping_cols (optional)
> TEXT, default: NULL.  An expression list used to group the input dataset into 
> discrete groups, running one preprocessing step per group. Similar to the SQL 
> GROUP BY clause. When this value is NULL, no grouping is used and a single 
> preprocessing step is performed for the whole data set.
> buffer_size (optional) INTEGER, default: ???. Number of source input rows to 
> pack into batch.
> {code}
> The output table contains the following columns:
> {code:java}
> id                                    INTEGER.  Unique id for packed table.
> dependent_varname                     FLOAT8[]. Packed array of dependent 
> variables.
> independent_varname           FLOAT8[].  Packed array of independent 
> variables.
> grouping_cols                         TEXT.  Name of grouping columns.
> {code}
> A summary table named <output_table>_summary is created together with the 
> output table. It has the following columns:
> {code:java}
> source_table                  Source table name.
> output_table                  Output table name from preprocessor.
> dependent_varname     Dependent variable.
> independent_varname   Independent variables.
> buffer_size                   Buffer size used in preprocessing step.
> dependent_vartype             “Continuous” or “Categorical”
> class_values                  Class values of the dependent variable (NULL 
> for continuous vars).
> num_rows_processed            The total number of rows that were used in the 
> computation.
> num_missing_rows_skipped      The total number of rows that were skipped 
> because of NULL values in them.
> grouping_cols                 Names of the grouping columns.
> {code}
> A standardization table named <output_table>_standardization is created 
> together with the output table. It has the following columns:
> {code:java}
>       <grouping_col_expression>       Group -- this column doesn't exist if 
> no grouping column was specified 
>       mean                            Mean of independent vars by group
>       std                             Standard deviation of independent vars 
> by group
> {code}
>  
>  Acceptance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to