Hi Aman Likely the easiest way to parallelize decision tree building is to parallelize the finding of the best split at each node, as it checks every non-constant feature for the best split. Several other approaches focus on how to parallelize tree building in the streaming or distributed cases, which we are not interested in at the moment (though partially fitting decision trees is a good separate project).
As I mentioned in the github issue, it is likely easier to focus on this single issue for GSoC as opposed to making it distinct from the multiclass prediction, as this will provide similar speedups either way but be more general. It'd be great if you could add your experience directly to the gist and perhaps links to prior work if you have any of those. Something major missing from this is a proposed timeline. Several projects fail because they are overly ambitious or not well managed time-wise. Showing a timeline will help us manage the project later on, and ensure that you're aware of what the steps of the project will be. Thanks for the effort so far! Let me know when you've made updates. Jacob On Wed, Mar 22, 2017 at 12:55 AM, Aman Pratik <[email protected]> wrote: > Hello Developers, > > This is Aman Pratik. I am currently pursuing my B.Tech from Indian > Institute of Technology, Varanasi. After doing some research I have found > some material on Decision Trees and Parallelization. Hence, I propose my > first draft for the project "Parallel Decision Tree Building" for GSoC 2017. > > Proposal : First Draft > <https://github.com/amanp10/scikit-learn/wiki/GSoC-2017-:-Parallel-Decision-Tree-Building> > > Why me? > > I have been working in Python for the past 2 years and have good idea > about Machine Learning algorithms. I am quite familiar with scikit-learn > both as a user and a developer. > > These are the issues/PRs I have worked/working on for the past few months. > > [MRG+1] Issue#5803 : Regression Test added #8112 > <https://github.com/scikit-learn/scikit-learn/pull/8112> > > [MRG] Issue#6673:Make a wrapper around functions that score an individual > feature #8038 <https://github.com/scikit-learn/scikit-learn/pull/8038> > > [MRG] Issue #7987: Embarrassingly parallel "n_restarts_optimizer" in > GaussianProcessRegressor #7997 > <https://github.com/scikit-learn/scikit-learn/pull/7997> > > My GitHub Profile: amanp10 <https://www.github.com/amanp10> > > I have worked with parallelization in one of my PR, so I am not new to it. > I have used cython a couple of times, though as a beginner. I have not used > Decision Tree much but I am familiar with the theory and algorithm. Also, I > am familiar with Benchmark tests, Unit tests and other technical knowledge > I would require for this project. > > Meanwhile, I have started my study for the subject and gaining experience > with Cython. I am looking forward to guidance from the potential mentors or > anyone willing to help. > > Thank You > > > _______________________________________________ > scikit-learn mailing list > [email protected] > https://mail.python.org/mailman/listinfo/scikit-learn > >
_______________________________________________ scikit-learn mailing list [email protected] https://mail.python.org/mailman/listinfo/scikit-learn
