GitHub user freeman-lab opened a pull request: https://github.com/apache/spark/pull/1361
Streaming mllib This PR implements a streaming linear regression analysis, in which a linear regression model is trained online as new data arrive. The design is based on discussions with @tdas and @mengxr, in which we determined how to add this functionality in a general way, with minimal changes to existing libraries. __Summary of additions:__ _StreamingRegression_ - An abstract class for fitting regression analyses online on streaming data, including training on (and updating) a model, and making predictions _StreamingLinearRegressionWithSGD_ - Class and companion object for running streaming linear regression _MLStreamingUtils_ - Utility for loading and parsing streaming data from a text file stream, could be extended with functions for loading data from Kafka, Network, etc. _StreamingLinearRegression_ - Example use case: fitting a model online to data from one stream, and making predictions on other data __Notes__ - I will definitely add tests but I wasn't sure where it makes sense to put them: mllib or streaming? - If this looks good, I can use the StreamingRegression class to do all other regression analyses (Ridge, Lasso, etc.), and a similar StreamingClassification class would give us logistic and SVM classification. You can merge this pull request into a Git repository by running: $ git pull https://github.com/freeman-lab/spark streaming-mllib Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1361.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1361 ---- commit 0898add2e1dd2f1faac9e8d08c758994af03ee6e Author: freeman <the.freeman....@gmail.com> Date: 2014-07-10T14:39:31Z Added dependency on streaming commit d99aa85d8f275ca605aacb2804f0c55fff10ff2b Author: freeman <the.freeman....@gmail.com> Date: 2014-07-10T14:40:55Z Helper methods for streaming MLlib apps commit 604f4d738357adccc0168f8449614e8e09d9f70e Author: freeman <the.freeman....@gmail.com> Date: 2014-07-10T14:41:25Z Expanded private class to include mllib commit c4b1143dc2ab39506aeefb2f7a89485196308d08 Author: freeman <the.freeman....@gmail.com> Date: 2014-07-10T14:43:16Z Streaming linear regression - Abstract class to support a variety of streaming regression analyses - Example concrete class for streaming linear regression - Example usage: continually train on one data stream and test on another commit 453974e75afbebfc605e80efaa32e8f45dc0e258 Author: freeman <the.freeman....@gmail.com> Date: 2014-07-10T19:36:14Z Fixed indentation commit fd31e036afe537b86d49487527eca83ac62c7630 Author: freeman <the.freeman....@gmail.com> Date: 2014-07-10T19:49:32Z Changed logging behavior ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---