Thank you for your answer, it just made me aware of many hidden-possible-future 
problems with my implementation.

> The first is that for any given application, the odds that
> the data will not fit in a single machine are small, especially if you 
> have an out-of-core tree builder.  Really, really big datasets are
> increasingly common, but are still a small minority of all datasets.

by out-of-core you mean the builder can fetch the data directly from a file 
instead of working from in-memory only (?)

> One question I have about your plan is whether your step (1) involves
> building trees or forests only from data held in memory or whether it 
> can be adapted to stream through the data (possibly several
> times).  If a streaming implementation is viable, then it may well be 
> that performance is still quite good for small datasets due to buffering.

I was planning to distribute the dataset files to all workers using Hadoop's 
DistributedCache. I think that a streaming implementation is feasible, the 
basic tree building algorithm (described here 
http://cwiki.apache.org/MAHOUT/random-forests.html) would have to stream 
through the data (either in-memory or from a file) for each node of the tree. 
During this pass, it computes the information gain (IG) for the selected 
variables. 
This algorithm could be improved to compute the IG's for a list of nodes, thus 
reducing the total number of passes through the data. When building the forest, 
the list of nodes comes from all the trees built by the mapper.

> Another way to put this is that the key question is how single node
> computation scales with input size.  If the scaling is relatively linear
> with data size, then your approach (3) will work no matter the data size.
> If scaling shows an evil memory size effect, then your approach (2) 
> would be required for large data sets.

I'll have to run some tests before answering this question, but I think that 
the memory usage of the improved algorithm (described above) will mainly be 
needed to store the IG's computations (variable probabilities...). One way to 
limit the memory usage is to limit the number of tree-nodes computed at each 
data pass. Increasing this limit should reduce the data passes but increase the 
memory usage, and vice versa.

There is still one case that this approach, even out-of-core, cannot handle: 
very large datasets that cannot fit in the node hard-drive, and thus must be 
distributed across the cluster.

abdelHakim
--- En date de : Lun 30.3.09, Ted Dunning <ted.dunn...@gmail.com> a écrit :

> De: Ted Dunning <ted.dunn...@gmail.com>
> Objet: Re: [gsoc] random forests
> À: mahout-dev@lucene.apache.org
> Date: Lundi 30 Mars 2009, 0h59
> I have two answers for you.
> 
> The first is that for any given application, the odds that
> the data will not
> fit in a single machine are small, especially if you have
> an out-of-core
> tree builder.  Really, really big datasets are
> increasingly common, but are
> still a small minority of all datasets.
> 
> The second answer is that the odds that SOME mahout
> application will be too
> large for a single node are quite high.
> 
> These aren't contradictory.  They just describe the
> long-tail nature of
> problem sizes.
> 
> One question I have about your plan is whether your step
> (1) involves
> building trees or forests only from data held in memory or
> whether it can be
> adapted to stream through the data (possibly several
> times).  If a streaming
> implementation is viable, then it may well be that
> performance is still
> quite good for small datasets due to buffering.
> 
> If streaming works, then a single node will be able to
> handle very large
> datasets but will just be kind of slow.  As you point
> out, that can be
> remedied trivially.
> 
> Another way to put this is that the key question is how
> single node
> computation scales with input size.  If the scaling is
> relatively linear
> with data size, then your approach (3) will work no matter
> the data size.
> If scaling shows an evil memory size effect, then your
> approach (2) would be
> required for large data sets.
> 
> On Sat, Mar 28, 2009 at 8:14 AM, deneche abdelhakim <a_dene...@yahoo.fr>wrote:
> 
> > My question is : when Mahout.RF will be used in a real
> application, what
> > are the odds that the dataset will be so large that it
> can't fit on every
> > machine of the cluster ?
> >
> > the answer to this question should help me decide
> which implementation I'll
> > choose.
> >
> 
> 
> 
> -- 
> Ted Dunning, CTO
> DeepDyve
> 
> 111 West Evelyn Ave. Ste. 202
> Sunnyvale, CA 94086
> www.deepdyve.com
> 408-773-0110 ext. 738
> 858-414-0013 (m)
> 408-773-0220 (fax)
> 



Reply via email to