Re: [gsoc] random forests

deneche abdelhakim Sat, 28 Mar 2009 12:11:29 -0700

you should read in . 2a

. This implementation is, relatively, "easy" given...


--- En date de : Sam 28.3.09, deneche abdelhakim <a_dene...@yahoo.fr> a écrit :

> De: deneche abdelhakim <a_dene...@yahoo.fr>
> Objet: Re: [gsoc] random forests
> À: mahout-dev@lucene.apache.org
> Date: Samedi 28 Mars 2009, 16h14
> 
> I'm actually writing my working plan, and it looks like
> this:
> 
> *****************************************************************
> 1. reference implementation of Random/Regression Forests
> Building Algorithm: 
>  . Build a forest of trees, the basic algorithm (described
> in the wiki) takes a subset from the dataset as a training
> set and builds a decision tree. This basic algorithm is
> repeated for each tree of the forest. 
>  . The forest is stored in a file, this way it can be used
> later to classify new cases
> 
> 2a. distributed Implementation A: 
>  . When the dataset can be replicated to all computing
> nodes.
>  . Each mapper has access to the whole dataset, if the
> forest contains N trees and we have M mappers, each mapper
> runs the basic building algorithm N/M times.
>  . This implementation is, relatively, given that the
> reference implementation is available, because each mapper
> runs the basic building algorithm "as it is".
> 
> 2b. Distributed Implementation B:
>  . When the dataset is so big that it can no longer fit on
> every computing node, it must be distributed over the
> cluster. 
>  . Each mapper has access to a subset from the dataset,
> thus all the mappers collaborate to build each tree of the
> forest.
>  . In this case, the basic algorithm must be rewritten to
> fit in the map-reduce paradigm.
> 
> 3. Run the Random Forest with a real dataset on EC2:
>  . This step is important, because running the RF on a
> local dual core machine is way different from running it on
> a real cluster with a real dataset.
>  . This can make for a good demo for Mahout
> 
> 4. If there is still time, implement one or two other
> important features of RFs such as Variable importance and
> Proximity estimation
> *****************************************************************
> 
> It is clear from the plan that I won't be able to do all
> those steps, and in some way I must choose only one
> implementation (2a or 2b) to do. The first implementation
> should take less time to implement than 2b and I'm quite
> sure I can go up to the 4th step, adding other features to
> the RF. BUT the second implementation is the only one
> capable of dealing with very large distributed datasets.
> 
> My question is : when Mahout.RF will be used in a real
> application, what are the odds that the dataset will be so
> large that it can't fit on every machine of the cluster ? 
> 
> the answer to this question should help me decide which
> implementation I'll choose.
> 
> --- En date de : Dim 22.3.09, Ted Dunning <ted.dunn...@gmail.com>
> a écrit :
> 
> > De: Ted Dunning <ted.dunn...@gmail.com>
> > Objet: Re: [gsoc] random forests
> > À: mahout-dev@lucene.apache.org
> > Date: Dimanche 22 Mars 2009, 0h36
> > Great expression!
> > 
> > You may be right about the nose-bleed tendency between
> the
> > two methods.
> > 
> > On Sat, Mar 21, 2009 at 4:46 AM, deneche abdelhakim
> <a_dene...@yahoo.fr>wrote:
> > 
> > > I can't find a no-nose-bleeding algorithm
> > 
> > 
> > 
> > 
> > -- 
> > Ted Dunning, CTO
> > DeepDyve
> > 
> 
> 
> 
>

Re: [gsoc] random forests

Reply via email to