GitHub user manishamde opened a pull request: https://github.com/apache/spark/pull/2607
[MLLIB] [WIP] SPARK-1547: Adding Gradient Boosting to MLlib Given the popular demand for gradient boosting and AdaBoost in MLlib, I am creating a WIP branch for early feedback on gradient boosting with AdaBoost to follow soon after this PR is accepted. This is based on work done along with @hirakendu that was pending due to decision tree optimizations and random forests work. Ideally, boosting algorithms should work with any base learners. This will soon be possible once the MLlib API is finalized -- we want to ensure we use a consistent interface for the underlying base learners. In the meantime, this PR uses decision trees as base learners for the gradient boosting algorithm. The current PR allows "pluggable" loss functions and provides least squares error and least absolute error by default. Here is the remaining task list: - [ ] Stochastic gradient boosting support â Re-use the BaggedPoint approach used for RandomForest. - [ ] BaggedRDD caching -- Avoid repeating feature to bin mapping for each tree estimator. Will require minor refactoring of RandomForest code. - [ ] Checkpointing â This approach will avoid long lineage chains. Need to conduct experiments to verify good default settings. - [ ] Unit Tests â I have performed some basic tests but I need to add them as unit tests. - [ ] Create public APIs - [ ] Tests on multiple cluster sizes and datasets â require help from the community on this front. Note: Classification is currently not supported by this PR since it requires discussion on the best way to support "deviance" as a loss function. cc: @jkbradley @hirakendu @mengxr @etrain @atalwalkar @chouqin You can merge this pull request into a Git repository by running: $ git pull https://github.com/manishamde/spark gbt Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2607.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2607 ---- commit 0ae1c0a77c9de22dd1ff50ad1e4c7b8a691aac38 Author: Manish Amde <manish...@gmail.com> Date: 2014-09-28T03:32:22Z basic gradient boosting code from earlier branches commit 55385216ff2d0a470ae783017d434d850762441f Author: Manish Amde <manish...@gmail.com> Date: 2014-09-28T04:32:31Z disable checkpointing for now commit 6251fd56388703d9b9450980a27cf9a9a98e750d Author: Manish Amde <manish...@gmail.com> Date: 2014-10-01T00:22:26Z modified method name commit cdceeef09822145af2620921a94c37384d3f64c7 Author: Manish Amde <manish...@gmail.com> Date: 2014-10-01T01:04:02Z added documentation ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org