Re: Out-of-core random forest implementation

Julien Nioche Fri, 15 Feb 2013 03:10:23 -0800

> As you point out, the problem (that I see) with mahout is that it offers no
> reason to stay within the framework, for example there is no consistent way
> of representing datasets unlike say numpy or pandas. Even something like
> that - load some data, have it automatically passed into a consistent
> tabular format for regression/classification would be very helpful.



+1. Would be great if there was some consistency in the way the classifiers
handle their inputs indeed. Could be SequenceFiles containing Vectors with
a separation of concern between the code preprocessing the inputs (from
Lucene indices, text files on FS, computation of weights, etc...) and the
classifiers themselves.

Julien


 wrote:
>
> > On Fri, Feb 15, 2013 at 1:06 AM, Marty Kube <
> > [email protected]> wrote:
> >
> > >
> > > On 01/28/2013 02:33 PM, Ted Dunning wrote:
> > >
> > >> I think I was suggesting something weaker.
> > >>
> > >> I was suggesting that trees get built against a portion of the data
> and
> > >> each node builds some number of trees against just the data it sees.
> >  This
> > >> is in the spirit of random forests, but not the letter.
> > >>
> > > I'm looking at the Mahout partial implementation.  It appears to me
> that
> > > some number of trees get built against the portion of the data each
> node
> > > sees based on the map reduce input split.
> > > Am I wrong here?  If not, Mahout already has an out-of-core
> > implementation.
> > >
> >
> > Indeed, each mapper will grow a subset of the forest using only the data
> > passed to it. Actually, this was based on a suggestion by Ted:
> >
> > "modify the original algorithm to build multiple trees for different
> > portions of the data. That loses some of the solidity of the original
> > method, but could actually do better if the splits exposed non-stationary
> > behavior."
> >
> >
> > > BTW - where did the name Partial Implementation come from?  Is this
> > > partially completed work :-)
> >
> >
> > Ha ha ha, the name came from the fact that each mapper/node has a
> "partial"
> > view of the dataset. But I do agree, that it's partially completed :P I
> > guess they were never enough user interest in Mahout's Decision Forest
> > implementation to motivate me to keep working on it, but I do see more
> and
> > more questions on the mailing lists about this, maybe I will get
> motivated
> > enough to work on this again.
> >
> > I did some research, some time ago, about how to grow the trees on a
> > cluster using all data when the dataset is too big to fit in memory. Two
> > papers got my attention:
> >
> > *RainForest - A Framework for Fast Decision Tree Construction of Large
> > Datasets [1]*
> > This paper describes how to grow the trees without loading all data in
> > memory. Using this, you need to copy the dataset on all computing nodes.
> >
> > *PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce
> [2]*
> > Although it uses map-reduce to grow the trees, it uses lot's of hacks and
> > tricks to make it work.
> >
> > Shouldn't be too difficult to use [1] and improve the current
> > implementation, this way the forest can be grown on "full" large dataset
> on
> > a computing cluster. One question though, do we really need Hadoop for
> this
> > ? I mean, as long as each node has access to the dataset files this
> should
> > be enough to build the forest on any number of nodes.
> >
> > [1]
> http://www.cs.cornell.edu/johannes/papers/1998/vldb1998-rainforest.pdf
> > [2] http://dejanseo.com.au/research/google/36296.pdf
> >
> >
> > >
> > >> Another option is that all the nodes start some trees based on their
> own
> > >> data and spit the set of all such trees to a central (possibly
> > distributed
> > >> store).  They also grab some number of trees back down from the
> central
> > >> store and try elaborating those trees using their own data.  These
> will
> > >> also be sent back to the central store.  This is closer to random
> > forests.
> > >>
> > >> A third option is like option two, but the trees that get pulled down
> > and
> > >> spit back up collect statistics until enough nodes report such stats
> at
> > >> which point a new branch is added and the stats are reset.  This is
> very
> > >> close to real random forests.  It should also be efficient because
> each
> > >> node is passing many trees across its own data in each pass.  If that
> > data
> > >> fits in memory (even in file cache) then the pass will be very fast.
>  If
> > >> it
> > >> doesn't fit in memory, then it will run as fast as the disk will let
> it
> > >> run.  Allowing a tree to commit with only 90% coverage should allow
> good
> > >> robustness against laggards while doing almost exactly the same thing
> as
> > >> the real random forest algorithm.
> > >>
> > >> On Mon, Jan 28, 2013 at 8:56 AM, Andy Twigg <[email protected]>
> > wrote:
> > >>
> > >>  Ted,
> > >>>
> > >>> Sorry, I don't understand. Are you suggesting that a single decision
> > >>> tree can be built efficiently in a distributed fashion?
> > >>>
> > >>> The following 2 ways seem like the naive ways of doing this:
> > >>> 1) each machine constructs one node of the tree, by scanning all the
> > >>> data, filtering those that don't get to its node (this is expensive),
> > >>> computing the split and then writing the split out. This requires a
> > >>> pass through the data for each node of the tree.
> > >>> 2) as 1), except that each machine writes out the filtered data after
> > >>> its node in the tree. This requires less scanning (as much as the
> > >>> local case I described earlier), but lots of data movement.
> > >>>
> > >>> Do you have an alternative method?
> > >>>
> > >>> Andy
> > >>>
> > >>>
> > >>> On 28 January 2013 16:42, Ted Dunning <[email protected]> wrote:
> > >>>
> > >>>> IF we have a step which permutes data (once) then I doubt that
> > >>>> redistribution is necessary.  At that point the randomness consists
> of
> > >>>> building trees based on different variable subsets and data subsets.
> > >>>>  The
> > >>>> original random forests only split on variable subsets.  How much
> this
> > >>>> matters is an open question.
> > >>>>
> > >>>> On Mon, Jan 28, 2013 at 2:36 AM, Sean Owen <[email protected]>
> wrote:
> > >>>>
> > >>>>  It sounds OK to me. Yes, I don't think you want to try to
> parallelize
> > >>>>> the building of each tree just to build it off all the data. It
> will
> > >>>>> be too slow.
> > >>>>>
> > >>>>> I imagine the game is to parallelize such that each worker has
> 1/Nth
> > >>>>> of the data, where N is as low as you can manage. Each worker is
> > >>>>> building one or more trees from its slice. Then iterate, and split
> up
> > >>>>> the data differently. Each worker can now cross-validate previous
> > >>>>> trees based on the new slice, or even build new ones, and iterate
> > >>>>> again.
> > >>>>>
> > >>>>> I'm guessing this process will tend to need to build more trees
> than
> > >>>>> usual. I think it's also possible that it will want to try to
> discard
> > >>>>> trees with high out-of-bag error.
> > >>>>>
> > >>>>> Just riffing. I think this is some potential green-field territory
> to
> > >>>>> really think about how you build trees when it doesn't nearly fit
> in
> > >>>>> memory, and what shape of computation gives a really nice tradeoff
> > >>>>> between speed and accuracy. I don't think the conventional approach
> > is
> > >>>>> that point but it's more like the above.
> > >>>>>
> > >>>>> On Mon, Jan 28, 2013 at 10:14 AM, Andy Twigg <[email protected]
> >
> > >>>>>
> > >>>> wrote:
> > >>>
> > >>>> Here's my thought for what might work well. Comments very welcome.
> > >>>>>>
> > >>>>>> We construct k decision trees (where k might be 20-100). Each tree
> > has
> > >>>>>> depth roughly 10-30. Rather than construct each tree in a
> > distributed
> > >>>>>> fashion, construct each tree locally by using multiple passes over
> > the
> > >>>>>> bagged sample for that tree (construct in BFS manner, assuming the
> > >>>>>> tree can fit in memory). Do this in parallel k times. I know this
> is
> > >>>>>> perhaps not the "mapreduce" way, but I'd be interested to hear
> > >>>>>> alternatives. It seems a bad idea to shuffle the data around on
> > every
> > >>>>>> BFS iteration per tree.
> > >>>>>>
> > >>>>>> I'm not planning to work on this right now, but would like to
> figure
> > >>>>>> out if Mahout is a good platform to work with if I want such an
> > >>>>>> algorithm.
> > >>>>>>
> > >>>>>> Andy
> > >>>>>>
> > >>>>>
> > >>>
> > >>> --
> > >>> Dr Andy Twigg
> > >>> Junior Research Fellow, St Johns College, Oxford
> > >>> Room 351, Department of Computer Science
> > >>> http://www.cs.ox.ac.uk/people/**andy.twigg/<
> > http://www.cs.ox.ac.uk/people/andy.twigg/>
> > >>> [email protected] | +447799647538
> > >>>
> > >>>
> > >
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Out-of-core random forest implementation

Reply via email to