you should read in . 2a . This implementation is, relatively, "easy" given...
--- En date de : Sam 28.3.09, deneche abdelhakim <a_dene...@yahoo.fr> a écrit : > De: deneche abdelhakim <a_dene...@yahoo.fr> > Objet: Re: [gsoc] random forests > À: mahout-dev@lucene.apache.org > Date: Samedi 28 Mars 2009, 16h14 > > I'm actually writing my working plan, and it looks like > this: > > ***************************************************************** > 1. reference implementation of Random/Regression Forests > Building Algorithm: > . Build a forest of trees, the basic algorithm (described > in the wiki) takes a subset from the dataset as a training > set and builds a decision tree. This basic algorithm is > repeated for each tree of the forest. > . The forest is stored in a file, this way it can be used > later to classify new cases > > 2a. distributed Implementation A: > . When the dataset can be replicated to all computing > nodes. > . Each mapper has access to the whole dataset, if the > forest contains N trees and we have M mappers, each mapper > runs the basic building algorithm N/M times. > . This implementation is, relatively, given that the > reference implementation is available, because each mapper > runs the basic building algorithm "as it is". > > 2b. Distributed Implementation B: > . When the dataset is so big that it can no longer fit on > every computing node, it must be distributed over the > cluster. > . Each mapper has access to a subset from the dataset, > thus all the mappers collaborate to build each tree of the > forest. > . In this case, the basic algorithm must be rewritten to > fit in the map-reduce paradigm. > > 3. Run the Random Forest with a real dataset on EC2: > . This step is important, because running the RF on a > local dual core machine is way different from running it on > a real cluster with a real dataset. > . This can make for a good demo for Mahout > > 4. If there is still time, implement one or two other > important features of RFs such as Variable importance and > Proximity estimation > ***************************************************************** > > It is clear from the plan that I won't be able to do all > those steps, and in some way I must choose only one > implementation (2a or 2b) to do. The first implementation > should take less time to implement than 2b and I'm quite > sure I can go up to the 4th step, adding other features to > the RF. BUT the second implementation is the only one > capable of dealing with very large distributed datasets. > > My question is : when Mahout.RF will be used in a real > application, what are the odds that the dataset will be so > large that it can't fit on every machine of the cluster ? > > the answer to this question should help me decide which > implementation I'll choose. > > --- En date de : Dim 22.3.09, Ted Dunning <ted.dunn...@gmail.com> > a écrit : > > > De: Ted Dunning <ted.dunn...@gmail.com> > > Objet: Re: [gsoc] random forests > > À: mahout-dev@lucene.apache.org > > Date: Dimanche 22 Mars 2009, 0h36 > > Great expression! > > > > You may be right about the nose-bleed tendency between > the > > two methods. > > > > On Sat, Mar 21, 2009 at 4:46 AM, deneche abdelhakim > <a_dene...@yahoo.fr>wrote: > > > > > I can't find a no-nose-bleeding algorithm > > > > > > > > > > -- > > Ted Dunning, CTO > > DeepDyve > > > > > >