GitHub user kaknikhil opened a pull request: https://github.com/apache/madlib/pull/241
MiniBatch Pre-Processor: Add new module minibatch_preprocessing JIRA: MADLIB-1200 MiniBatch Preprocessor is a utility function to pre-process the input data for use with models that support mini-batching as an optimization. TODO add more description here ?? The main purpose of the function is to prepare the training data for minibatching algorithms. 1. If the dependent variable is boolean or text, perform one hot encoding. N/A for numeric. 2. Typecast independent variable to double precision[] 2. Based on the buffer size, group all the dependent and independent variables in a single tuple representative of the buffer. Notes 1. Ignore null values in independent and dependent variables 2. Standardize the input before packing it. Other changes: 1. Removed __ from public methods in utils_regularization.py Renamed __utils_ind_var_scales and __utils_ind_var_scales_grouping so that we can access them from within a class, more specifically the minibatch_preprocessing module. 2. Added new function for regex match and refactored elastic_net.py_in to use this function Co-authored-by: Rahul Iyer <ri...@apache.org> Co-authored-by: Jingyi Mei <j...@pivotal.io> Co-authored-by: Nandish Jayaram <njaya...@apache.org> Co-authored-by: Orhan Kislal <okis...@pivotal.io> You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib feature/minibatch_preprocessing Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/241.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #241 ---- commit 7e89d4097d1d889adfa2eff3ed6217c75b519427 Author: Nikhil Kak <nkak@...> Date: 2018-01-24T20:01:40Z MiniBatch Pre-Processor: Add new module minibatch_preprocessing JIRA: MADLIB-1200 MiniBatch Preprocessor is a utility function to pre-process the input data for use with models that support mini-batching as an optimization. TODO add more description here ?? The main purpose of the function is to prepare the training data for minibatching algorithms. 1. If the dependent variable is boolean or text, perform one hot encoding. N/A for numeric. 2. Typecast independent variable to double precision[] 2. Based on the buffer size, group all the dependent and independent variables in a single tuple representative of the buffer. Notes 1. Ignore null values in independent and dependent variables 2. Standardize the input before packing it. Other changes: 1. Removed __ from public methods in utils_regularization.py Rename __utils_ind_var_scales and __utils_ind_var_scales_grouping so that we can access them from within a class, more specifically the minibatch_preprocessing module. 2. Added new function for regex match and refactored elastic_net.py_in to use this function Co-authored-by: Rahul Iyer <ri...@apache.org> Co-authored-by: Jingyi Mei <j...@pivotal.io> Co-authored-by: Nandish Jayaram <njaya...@apache.org> Co-authored-by: Orhan Kislal <okis...@pivotal.io> ---- ---