GitHub user freeman-lab opened a pull request:

    https://github.com/apache/spark/pull/1361

    Streaming mllib

    This PR implements a streaming linear regression analysis, in which a 
linear regression model is trained online as new data arrive. The design is 
based on discussions with @tdas and @mengxr, in which we determined how to add 
this functionality in a general way, with minimal changes to existing libraries.
    
    __Summary of additions:__
    
    _StreamingRegression_
    - An abstract class for fitting regression analyses online on streaming 
data, including training on (and updating) a model, and making predictions
    
    _StreamingLinearRegressionWithSGD_
    - Class and companion object for running streaming linear regression
    
    _MLStreamingUtils_
    - Utility for loading and parsing streaming data from a text file stream, 
could be extended with functions for loading data from Kafka, Network, etc.
    
    _StreamingLinearRegression_
    - Example use case: fitting a model online to data from one stream, and 
making predictions on other data
    
    __Notes__
    - I will definitely add tests but I wasn't sure where it makes sense to put 
them: mllib or streaming?
    - If this looks good, I can use the StreamingRegression class to do all 
other regression analyses (Ridge, Lasso, etc.), and a similar 
StreamingClassification class would give us logistic and SVM classification.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/freeman-lab/spark streaming-mllib

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1361.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1361
    
----
commit 0898add2e1dd2f1faac9e8d08c758994af03ee6e
Author: freeman <the.freeman....@gmail.com>
Date:   2014-07-10T14:39:31Z

    Added dependency on streaming

commit d99aa85d8f275ca605aacb2804f0c55fff10ff2b
Author: freeman <the.freeman....@gmail.com>
Date:   2014-07-10T14:40:55Z

    Helper methods for streaming MLlib apps

commit 604f4d738357adccc0168f8449614e8e09d9f70e
Author: freeman <the.freeman....@gmail.com>
Date:   2014-07-10T14:41:25Z

    Expanded private class to include mllib

commit c4b1143dc2ab39506aeefb2f7a89485196308d08
Author: freeman <the.freeman....@gmail.com>
Date:   2014-07-10T14:43:16Z

    Streaming linear regression
    
    - Abstract class to support a variety of streaming regression analyses
    - Example concrete class for streaming linear regression
    - Example usage: continually train on one data stream and test on
    another

commit 453974e75afbebfc605e80efaa32e8f45dc0e258
Author: freeman <the.freeman....@gmail.com>
Date:   2014-07-10T19:36:14Z

    Fixed indentation

commit fd31e036afe537b86d49487527eca83ac62c7630
Author: freeman <the.freeman....@gmail.com>
Date:   2014-07-10T19:49:32Z

    Changed logging behavior

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to