Re: Out-of-core random forest implementation

Ted Dunning Fri, 15 Feb 2013 20:11:46 -0800

I think that it helps to view random forests in terms of the general
approach.


The basic idea is that you want to train up many weak classifiers.
 Classical random forest selects a subset of variables to work with.  You
could also work on a subset of the data as well.  This is what partial does.

You can also have each node collect the data for many trees from part of
the data and send their summary statistics to other nodes so that all of
the data for a particular tree winds up in the same place.  These other
nodes can elaborate all of the trees (possibly in parallel) by adding a new
decision level and send the trees back to all of the original nodes.  These
can pass over their data collating partial statistics for the next round in
the same way as the original round.  This approach admits many different
divisions of the problem.  Each of the original nodes can look at all or
part of the data.  If part, it can be a randomized part of the data or a
map-reduce style slice.  Each of the original nodes can learn for all of
the trees, or a subset.  If a subset, you can have multiple nodes for each
data slice or just one. The collating nodes can involve many nodes or a
few.  The collating nodes can be a separate set of nodes or they can just
be original nodes who moonlight as collators.

My guess is that having all original nodes handle all trees is a good move
since the number of trees is likely to be relatively small and that makes
passing over all of the data simple.

This style of coding is well suited for BSP such as Giraph, but it isn't
hard to hand-code either.  It is a natural fit for something like Mesos of
Yarn as well.

On Fri, Feb 15, 2013 at 6:29 PM, Marty Kube <[email protected]>wrote:

> I just took a look at the Rain Forest paper.  It's from '98 and the feel
> is really different.  Large main memory is measured in MB and you use like
> only one computer to do stuff.  Weird :-)
>
> I think the jist of the algorithm is to build one tree per host and to
> have roughly one pass over the data set per node in the tree.  Is that
> scalable?  If the data is large I tend to think you would want few passes
> through the data and not repeated passes over the entire dataset per host.
>
> It makes me think about sampling the data set until in fits in memory or
> using the partial implementation...
>
> On 02/15/2013 03:09 AM, deneche abdelhakim wrote:
>
>> On Fri, Feb 15, 2013 at 1:06 AM, Marty Kube <
>> martykube@**beavercreekconsulting.com<[email protected]>>
>> wrote:
>>
>>  On 01/28/2013 02:33 PM, Ted Dunning wrote:
>>>
>>>  I think I was suggesting something weaker.
>>>>
>>>> I was suggesting that trees get built against a portion of the data and
>>>> each node builds some number of trees against just the data it sees.
>>>>  This
>>>> is in the spirit of random forests, but not the letter.
>>>>
>>>>  I'm looking at the Mahout partial implementation.  It appears to me
>>> that
>>> some number of trees get built against the portion of the data each node
>>> sees based on the map reduce input split.
>>> Am I wrong here?  If not, Mahout already has an out-of-core
>>> implementation.
>>>
>>>  Indeed, each mapper will grow a subset of the forest using only the data
>> passed to it. Actually, this was based on a suggestion by Ted:
>>
>> "modify the original algorithm to build multiple trees for different
>> portions of the data. That loses some of the solidity of the original
>> method, but could actually do better if the splits exposed non-stationary
>> behavior."
>>
>>
>>  BTW - where did the name Partial Implementation come from?  Is this
>>> partially completed work :-)
>>>
>>
>> Ha ha ha, the name came from the fact that each mapper/node has a
>> "partial"
>> view of the dataset. But I do agree, that it's partially completed :P I
>> guess they were never enough user interest in Mahout's Decision Forest
>> implementation to motivate me to keep working on it, but I do see more and
>> more questions on the mailing lists about this, maybe I will get motivated
>> enough to work on this again.
>>
>> I did some research, some time ago, about how to grow the trees on a
>> cluster using all data when the dataset is too big to fit in memory. Two
>> papers got my attention:
>>
>> *RainForest - A Framework for Fast Decision Tree Construction of Large
>> Datasets [1]*
>> This paper describes how to grow the trees without loading all data in
>> memory. Using this, you need to copy the dataset on all computing nodes.
>>
>> *PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce [2]*
>> Although it uses map-reduce to grow the trees, it uses lot's of hacks and
>> tricks to make it work.
>>
>> Shouldn't be too difficult to use [1] and improve the current
>> implementation, this way the forest can be grown on "full" large dataset
>> on
>> a computing cluster. One question though, do we really need Hadoop for
>> this
>> ? I mean, as long as each node has access to the dataset files this should
>> be enough to build the forest on any number of nodes.
>>
>> [1] http://www.cs.cornell.edu/**johannes/papers/1998/vldb1998-**
>> rainforest.pdf<http://www.cs.cornell.edu/johannes/papers/1998/vldb1998-rainforest.pdf>
>> [2] 
>> http://dejanseo.com.au/**research/google/36296.pdf<http://dejanseo.com.au/research/google/36296.pdf>
>>
>>
>>  Another option is that all the nodes start some trees based on their own
>>>> data and spit the set of all such trees to a central (possibly
>>>> distributed
>>>> store).  They also grab some number of trees back down from the central
>>>> store and try elaborating those trees using their own data.  These will
>>>> also be sent back to the central store.  This is closer to random
>>>> forests.
>>>>
>>>> A third option is like option two, but the trees that get pulled down
>>>> and
>>>> spit back up collect statistics until enough nodes report such stats at
>>>> which point a new branch is added and the stats are reset.  This is very
>>>> close to real random forests.  It should also be efficient because each
>>>> node is passing many trees across its own data in each pass.  If that
>>>> data
>>>> fits in memory (even in file cache) then the pass will be very fast.  If
>>>> it
>>>> doesn't fit in memory, then it will run as fast as the disk will let it
>>>> run.  Allowing a tree to commit with only 90% coverage should allow good
>>>> robustness against laggards while doing almost exactly the same thing as
>>>> the real random forest algorithm.
>>>>
>>>> On Mon, Jan 28, 2013 at 8:56 AM, Andy Twigg <[email protected]>
>>>> wrote:
>>>>
>>>>   Ted,
>>>>
>>>>> Sorry, I don't understand. Are you suggesting that a single decision
>>>>> tree can be built efficiently in a distributed fashion?
>>>>>
>>>>> The following 2 ways seem like the naive ways of doing this:
>>>>> 1) each machine constructs one node of the tree, by scanning all the
>>>>> data, filtering those that don't get to its node (this is expensive),
>>>>> computing the split and then writing the split out. This requires a
>>>>> pass through the data for each node of the tree.
>>>>> 2) as 1), except that each machine writes out the filtered data after
>>>>> its node in the tree. This requires less scanning (as much as the
>>>>> local case I described earlier), but lots of data movement.
>>>>>
>>>>> Do you have an alternative method?
>>>>>
>>>>> Andy
>>>>>
>>>>>
>>>>> On 28 January 2013 16:42, Ted Dunning <[email protected]> wrote:
>>>>>
>>>>>  IF we have a step which permutes data (once) then I doubt that
>>>>>> redistribution is necessary.  At that point the randomness consists of
>>>>>> building trees based on different variable subsets and data subsets.
>>>>>>   The
>>>>>> original random forests only split on variable subsets.  How much this
>>>>>> matters is an open question.
>>>>>>
>>>>>> On Mon, Jan 28, 2013 at 2:36 AM, Sean Owen <[email protected]> wrote:
>>>>>>
>>>>>>   It sounds OK to me. Yes, I don't think you want to try to
>>>>>> parallelize
>>>>>>
>>>>>>> the building of each tree just to build it off all the data. It will
>>>>>>> be too slow.
>>>>>>>
>>>>>>> I imagine the game is to parallelize such that each worker has 1/Nth
>>>>>>> of the data, where N is as low as you can manage. Each worker is
>>>>>>> building one or more trees from its slice. Then iterate, and split up
>>>>>>> the data differently. Each worker can now cross-validate previous
>>>>>>> trees based on the new slice, or even build new ones, and iterate
>>>>>>> again.
>>>>>>>
>>>>>>> I'm guessing this process will tend to need to build more trees than
>>>>>>> usual. I think it's also possible that it will want to try to discard
>>>>>>> trees with high out-of-bag error.
>>>>>>>
>>>>>>> Just riffing. I think this is some potential green-field territory to
>>>>>>> really think about how you build trees when it doesn't nearly fit in
>>>>>>> memory, and what shape of computation gives a really nice tradeoff
>>>>>>> between speed and accuracy. I don't think the conventional approach
>>>>>>> is
>>>>>>> that point but it's more like the above.
>>>>>>>
>>>>>>> On Mon, Jan 28, 2013 at 10:14 AM, Andy Twigg <[email protected]>
>>>>>>>
>>>>>>>  wrote:
>>>>>> Here's my thought for what might work well. Comments very welcome.
>>>>>>
>>>>>>> We construct k decision trees (where k might be 20-100). Each tree
>>>>>>>> has
>>>>>>>> depth roughly 10-30. Rather than construct each tree in a
>>>>>>>> distributed
>>>>>>>> fashion, construct each tree locally by using multiple passes over
>>>>>>>> the
>>>>>>>> bagged sample for that tree (construct in BFS manner, assuming the
>>>>>>>> tree can fit in memory). Do this in parallel k times. I know this is
>>>>>>>> perhaps not the "mapreduce" way, but I'd be interested to hear
>>>>>>>> alternatives. It seems a bad idea to shuffle the data around on
>>>>>>>> every
>>>>>>>> BFS iteration per tree.
>>>>>>>>
>>>>>>>> I'm not planning to work on this right now, but would like to figure
>>>>>>>> out if Mahout is a good platform to work with if I want such an
>>>>>>>> algorithm.
>>>>>>>>
>>>>>>>> Andy
>>>>>>>>
>>>>>>>>  --
>>>>> Dr Andy Twigg
>>>>> Junior Research Fellow, St Johns College, Oxford
>>>>> Room 351, Department of Computer Science
>>>>> http://www.cs.ox.ac.uk/people/****andy.twigg/<http://www.cs.ox.ac.uk/people/**andy.twigg/>
>>>>> <http://www.cs.**ox.ac.uk/people/andy.twigg/<http://www.cs.ox.ac.uk/people/andy.twigg/>
>>>>> >
>>>>> [email protected] | +447799647538
>>>>>
>>>>>
>>>>>
>

Re: Out-of-core random forest implementation

Reply via email to