GitHub user manishamde opened a pull request:

    https://github.com/apache/spark/pull/2607

    [MLLIB] [WIP] SPARK-1547: Adding Gradient Boosting to MLlib

    Given the popular demand for gradient boosting and AdaBoost in MLlib, I am 
creating a WIP branch for early feedback on gradient boosting with AdaBoost to 
follow soon after this PR is accepted. This is based on work done along with 
@hirakendu that was pending due to decision tree optimizations and random 
forests work.
    
    Ideally, boosting algorithms should work with any base learners.  This will 
soon be possible once the MLlib API is finalized -- we want to ensure we use a 
consistent interface for the underlying base learners. In the meantime, this PR 
uses decision trees as base learners for the gradient boosting algorithm. The 
current PR allows "pluggable" loss functions and provides least squares error 
and least absolute error by default.
    
    Here is the remaining task list:
    - [ ] Stochastic gradient boosting support – Re-use the BaggedPoint 
approach used for RandomForest.
    - [ ] BaggedRDD caching -- Avoid repeating feature to bin mapping for each 
tree estimator. Will require minor refactoring of RandomForest code.
    - [ ] Checkpointing – This approach will avoid long lineage chains. Need 
to conduct experiments to verify good default settings.
    - [ ] Unit Tests – I have performed some basic tests but I need to add 
them as unit tests.
    - [ ] Create public APIs
    - [ ] Tests on multiple cluster sizes and datasets – require help from 
the community on this front.
    
    Note: Classification is currently not supported by this PR since it 
requires discussion on the best way to support "deviance" as a loss function.
    
    cc: @jkbradley @hirakendu @mengxr @etrain @atalwalkar @chouqin

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/manishamde/spark gbt

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2607.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2607
    
----
commit 0ae1c0a77c9de22dd1ff50ad1e4c7b8a691aac38
Author: Manish Amde <manish...@gmail.com>
Date:   2014-09-28T03:32:22Z

    basic gradient boosting code from earlier branches

commit 55385216ff2d0a470ae783017d434d850762441f
Author: Manish Amde <manish...@gmail.com>
Date:   2014-09-28T04:32:31Z

    disable checkpointing for now

commit 6251fd56388703d9b9450980a27cf9a9a98e750d
Author: Manish Amde <manish...@gmail.com>
Date:   2014-10-01T00:22:26Z

    modified method name

commit cdceeef09822145af2620921a94c37384d3f64c7
Author: Manish Amde <manish...@gmail.com>
Date:   2014-10-01T01:04:02Z

    added documentation

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to