Re: Out-of-core random forest implementation

Andy Twigg Fri, 15 Feb 2013 06:43:48 -0800

Memory mapping would help, but a proper scalable random forest
implementation should be the goal imo, especially given the claims of
mahout.


I am having a hard time convincing myself that doing it on hadoop is
the best idea (and like I said, it's not like there are other
libraries that would make life much easier).

Andy


On 15 February 2013 13:47, Marty Kube <[email protected]> wrote:
> The hurdle I faced on my current project is that we had many random forest
> and the RAM requirements per JVM during classification were too big so we
> could not use Mahout.
>
> We went to a memory mapped forest representation which worked out nicely.
> Is that a feature Mahout could use?
>
>
> On 02/15/2013 03:09 AM, deneche abdelhakim wrote:
>>
>> On Fri, Feb 15, 2013 at 1:06 AM, Marty Kube <
>> [email protected]> wrote:
>>
>>> On 01/28/2013 02:33 PM, Ted Dunning wrote:
>>>
>>>> I think I was suggesting something weaker.
>>>>
>>>> I was suggesting that trees get built against a portion of the data and
>>>> each node builds some number of trees against just the data it sees.
>>>> This
>>>> is in the spirit of random forests, but not the letter.
>>>>
>>> I'm looking at the Mahout partial implementation.  It appears to me that
>>> some number of trees get built against the portion of the data each node
>>> sees based on the map reduce input split.
>>> Am I wrong here?  If not, Mahout already has an out-of-core
>>> implementation.
>>>
>> Indeed, each mapper will grow a subset of the forest using only the data
>> passed to it. Actually, this was based on a suggestion by Ted:
>>
>> "modify the original algorithm to build multiple trees for different
>> portions of the data. That loses some of the solidity of the original
>> method, but could actually do better if the splits exposed non-stationary
>> behavior."
>>
>>
>>> BTW - where did the name Partial Implementation come from?  Is this
>>> partially completed work :-)
>>
>>
>> Ha ha ha, the name came from the fact that each mapper/node has a
>> "partial"
>> view of the dataset. But I do agree, that it's partially completed :P I
>> guess they were never enough user interest in Mahout's Decision Forest
>> implementation to motivate me to keep working on it, but I do see more and
>> more questions on the mailing lists about this, maybe I will get motivated
>> enough to work on this again.
>>
>> I did some research, some time ago, about how to grow the trees on a
>> cluster using all data when the dataset is too big to fit in memory. Two
>> papers got my attention:
>>
>> *RainForest - A Framework for Fast Decision Tree Construction of Large
>> Datasets [1]*
>> This paper describes how to grow the trees without loading all data in
>> memory. Using this, you need to copy the dataset on all computing nodes.
>>
>> *PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce [2]*
>> Although it uses map-reduce to grow the trees, it uses lot's of hacks and
>> tricks to make it work.
>>
>> Shouldn't be too difficult to use [1] and improve the current
>> implementation, this way the forest can be grown on "full" large dataset
>> on
>> a computing cluster. One question though, do we really need Hadoop for
>> this
>> ? I mean, as long as each node has access to the dataset files this should
>> be enough to build the forest on any number of nodes.
>>
>> [1] http://www.cs.cornell.edu/johannes/papers/1998/vldb1998-rainforest.pdf
>> [2] http://dejanseo.com.au/research/google/36296.pdf
>>
>>
>>>> Another option is that all the nodes start some trees based on their own
>>>> data and spit the set of all such trees to a central (possibly
>>>> distributed
>>>> store).  They also grab some number of trees back down from the central
>>>> store and try elaborating those trees using their own data.  These will
>>>> also be sent back to the central store.  This is closer to random
>>>> forests.
>>>>
>>>> A third option is like option two, but the trees that get pulled down
>>>> and
>>>> spit back up collect statistics until enough nodes report such stats at
>>>> which point a new branch is added and the stats are reset.  This is very
>>>> close to real random forests.  It should also be efficient because each
>>>> node is passing many trees across its own data in each pass.  If that
>>>> data
>>>> fits in memory (even in file cache) then the pass will be very fast.  If
>>>> it
>>>> doesn't fit in memory, then it will run as fast as the disk will let it
>>>> run.  Allowing a tree to commit with only 90% coverage should allow good
>>>> robustness against laggards while doing almost exactly the same thing as
>>>> the real random forest algorithm.
>>>>
>>>> On Mon, Jan 28, 2013 at 8:56 AM, Andy Twigg <[email protected]>
>>>> wrote:
>>>>
>>>>   Ted,
>>>>>
>>>>> Sorry, I don't understand. Are you suggesting that a single decision
>>>>> tree can be built efficiently in a distributed fashion?
>>>>>
>>>>> The following 2 ways seem like the naive ways of doing this:
>>>>> 1) each machine constructs one node of the tree, by scanning all the
>>>>> data, filtering those that don't get to its node (this is expensive),
>>>>> computing the split and then writing the split out. This requires a
>>>>> pass through the data for each node of the tree.
>>>>> 2) as 1), except that each machine writes out the filtered data after
>>>>> its node in the tree. This requires less scanning (as much as the
>>>>> local case I described earlier), but lots of data movement.
>>>>>
>>>>> Do you have an alternative method?
>>>>>
>>>>> Andy
>>>>>
>>>>>
>>>>> On 28 January 2013 16:42, Ted Dunning <[email protected]> wrote:
>>>>>
>>>>>> IF we have a step which permutes data (once) then I doubt that
>>>>>> redistribution is necessary.  At that point the randomness consists of
>>>>>> building trees based on different variable subsets and data subsets.
>>>>>>   The
>>>>>> original random forests only split on variable subsets.  How much this
>>>>>> matters is an open question.
>>>>>>
>>>>>> On Mon, Jan 28, 2013 at 2:36 AM, Sean Owen <[email protected]> wrote:
>>>>>>
>>>>>>   It sounds OK to me. Yes, I don't think you want to try to
>>>>>> parallelize
>>>>>>>
>>>>>>> the building of each tree just to build it off all the data. It will
>>>>>>> be too slow.
>>>>>>>
>>>>>>> I imagine the game is to parallelize such that each worker has 1/Nth
>>>>>>> of the data, where N is as low as you can manage. Each worker is
>>>>>>> building one or more trees from its slice. Then iterate, and split up
>>>>>>> the data differently. Each worker can now cross-validate previous
>>>>>>> trees based on the new slice, or even build new ones, and iterate
>>>>>>> again.
>>>>>>>
>>>>>>> I'm guessing this process will tend to need to build more trees than
>>>>>>> usual. I think it's also possible that it will want to try to discard
>>>>>>> trees with high out-of-bag error.
>>>>>>>
>>>>>>> Just riffing. I think this is some potential green-field territory to
>>>>>>> really think about how you build trees when it doesn't nearly fit in
>>>>>>> memory, and what shape of computation gives a really nice tradeoff
>>>>>>> between speed and accuracy. I don't think the conventional approach
>>>>>>> is
>>>>>>> that point but it's more like the above.
>>>>>>>
>>>>>>> On Mon, Jan 28, 2013 at 10:14 AM, Andy Twigg <[email protected]>
>>>>>>>
>>>>>> wrote:
>>>>>> Here's my thought for what might work well. Comments very welcome.
>>>>>>>>
>>>>>>>> We construct k decision trees (where k might be 20-100). Each tree
>>>>>>>> has
>>>>>>>> depth roughly 10-30. Rather than construct each tree in a
>>>>>>>> distributed
>>>>>>>> fashion, construct each tree locally by using multiple passes over
>>>>>>>> the
>>>>>>>> bagged sample for that tree (construct in BFS manner, assuming the
>>>>>>>> tree can fit in memory). Do this in parallel k times. I know this is
>>>>>>>> perhaps not the "mapreduce" way, but I'd be interested to hear
>>>>>>>> alternatives. It seems a bad idea to shuffle the data around on
>>>>>>>> every
>>>>>>>> BFS iteration per tree.
>>>>>>>>
>>>>>>>> I'm not planning to work on this right now, but would like to figure
>>>>>>>> out if Mahout is a good platform to work with if I want such an
>>>>>>>> algorithm.
>>>>>>>>
>>>>>>>> Andy
>>>>>>>>
>>>>> --
>>>>> Dr Andy Twigg
>>>>> Junior Research Fellow, St Johns College, Oxford
>>>>> Room 351, Department of Computer Science
>>>>>
>>>>> http://www.cs.ox.ac.uk/people/**andy.twigg/<http://www.cs.ox.ac.uk/people/andy.twigg/>
>>>>> [email protected] | +447799647538
>>>>>
>>>>>
>



-- 
Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
[email protected] | +447799647538

On 15 February 2013 13:47, Marty Kube <[email protected]> wrote:
> The hurdle I faced on my current project is that we had many random forest
> and the RAM requirements per JVM during classification were too big so we
> could not use Mahout.
>
> We went to a memory mapped forest representation which worked out nicely.
> Is that a feature Mahout could use?
>
>
> On 02/15/2013 03:09 AM, deneche abdelhakim wrote:
>>
>> On Fri, Feb 15, 2013 at 1:06 AM, Marty Kube <
>> [email protected]> wrote:
>>
>>> On 01/28/2013 02:33 PM, Ted Dunning wrote:
>>>
>>>> I think I was suggesting something weaker.
>>>>
>>>> I was suggesting that trees get built against a portion of the data and
>>>> each node builds some number of trees against just the data it sees.
>>>> This
>>>> is in the spirit of random forests, but not the letter.
>>>>
>>> I'm looking at the Mahout partial implementation.  It appears to me that
>>> some number of trees get built against the portion of the data each node
>>> sees based on the map reduce input split.
>>> Am I wrong here?  If not, Mahout already has an out-of-core
>>> implementation.
>>>
>> Indeed, each mapper will grow a subset of the forest using only the data
>> passed to it. Actually, this was based on a suggestion by Ted:
>>
>> "modify the original algorithm to build multiple trees for different
>> portions of the data. That loses some of the solidity of the original
>> method, but could actually do better if the splits exposed non-stationary
>> behavior."
>>
>>
>>> BTW - where did the name Partial Implementation come from?  Is this
>>> partially completed work :-)
>>
>>
>> Ha ha ha, the name came from the fact that each mapper/node has a
>> "partial"
>> view of the dataset. But I do agree, that it's partially completed :P I
>> guess they were never enough user interest in Mahout's Decision Forest
>> implementation to motivate me to keep working on it, but I do see more and
>> more questions on the mailing lists about this, maybe I will get motivated
>> enough to work on this again.
>>
>> I did some research, some time ago, about how to grow the trees on a
>> cluster using all data when the dataset is too big to fit in memory. Two
>> papers got my attention:
>>
>> *RainForest - A Framework for Fast Decision Tree Construction of Large
>> Datasets [1]*
>> This paper describes how to grow the trees without loading all data in
>> memory. Using this, you need to copy the dataset on all computing nodes.
>>
>> *PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce [2]*
>> Although it uses map-reduce to grow the trees, it uses lot's of hacks and
>> tricks to make it work.
>>
>> Shouldn't be too difficult to use [1] and improve the current
>> implementation, this way the forest can be grown on "full" large dataset
>> on
>> a computing cluster. One question though, do we really need Hadoop for
>> this
>> ? I mean, as long as each node has access to the dataset files this should
>> be enough to build the forest on any number of nodes.
>>
>> [1] http://www.cs.cornell.edu/johannes/papers/1998/vldb1998-rainforest.pdf
>> [2] http://dejanseo.com.au/research/google/36296.pdf
>>
>>
>>>> Another option is that all the nodes start some trees based on their own
>>>> data and spit the set of all such trees to a central (possibly
>>>> distributed
>>>> store).  They also grab some number of trees back down from the central
>>>> store and try elaborating those trees using their own data.  These will
>>>> also be sent back to the central store.  This is closer to random
>>>> forests.
>>>>
>>>> A third option is like option two, but the trees that get pulled down
>>>> and
>>>> spit back up collect statistics until enough nodes report such stats at
>>>> which point a new branch is added and the stats are reset.  This is very
>>>> close to real random forests.  It should also be efficient because each
>>>> node is passing many trees across its own data in each pass.  If that
>>>> data
>>>> fits in memory (even in file cache) then the pass will be very fast.  If
>>>> it
>>>> doesn't fit in memory, then it will run as fast as the disk will let it
>>>> run.  Allowing a tree to commit with only 90% coverage should allow good
>>>> robustness against laggards while doing almost exactly the same thing as
>>>> the real random forest algorithm.
>>>>
>>>> On Mon, Jan 28, 2013 at 8:56 AM, Andy Twigg <[email protected]>
>>>> wrote:
>>>>
>>>>   Ted,
>>>>>
>>>>> Sorry, I don't understand. Are you suggesting that a single decision
>>>>> tree can be built efficiently in a distributed fashion?
>>>>>
>>>>> The following 2 ways seem like the naive ways of doing this:
>>>>> 1) each machine constructs one node of the tree, by scanning all the
>>>>> data, filtering those that don't get to its node (this is expensive),
>>>>> computing the split and then writing the split out. This requires a
>>>>> pass through the data for each node of the tree.
>>>>> 2) as 1), except that each machine writes out the filtered data after
>>>>> its node in the tree. This requires less scanning (as much as the
>>>>> local case I described earlier), but lots of data movement.
>>>>>
>>>>> Do you have an alternative method?
>>>>>
>>>>> Andy
>>>>>
>>>>>
>>>>> On 28 January 2013 16:42, Ted Dunning <[email protected]> wrote:
>>>>>
>>>>>> IF we have a step which permutes data (once) then I doubt that
>>>>>> redistribution is necessary.  At that point the randomness consists of
>>>>>> building trees based on different variable subsets and data subsets.
>>>>>>   The
>>>>>> original random forests only split on variable subsets.  How much this
>>>>>> matters is an open question.
>>>>>>
>>>>>> On Mon, Jan 28, 2013 at 2:36 AM, Sean Owen <[email protected]> wrote:
>>>>>>
>>>>>>   It sounds OK to me. Yes, I don't think you want to try to
>>>>>> parallelize
>>>>>>>
>>>>>>> the building of each tree just to build it off all the data. It will
>>>>>>> be too slow.
>>>>>>>
>>>>>>> I imagine the game is to parallelize such that each worker has 1/Nth
>>>>>>> of the data, where N is as low as you can manage. Each worker is
>>>>>>> building one or more trees from its slice. Then iterate, and split up
>>>>>>> the data differently. Each worker can now cross-validate previous
>>>>>>> trees based on the new slice, or even build new ones, and iterate
>>>>>>> again.
>>>>>>>
>>>>>>> I'm guessing this process will tend to need to build more trees than
>>>>>>> usual. I think it's also possible that it will want to try to discard
>>>>>>> trees with high out-of-bag error.
>>>>>>>
>>>>>>> Just riffing. I think this is some potential green-field territory to
>>>>>>> really think about how you build trees when it doesn't nearly fit in
>>>>>>> memory, and what shape of computation gives a really nice tradeoff
>>>>>>> between speed and accuracy. I don't think the conventional approach
>>>>>>> is
>>>>>>> that point but it's more like the above.
>>>>>>>
>>>>>>> On Mon, Jan 28, 2013 at 10:14 AM, Andy Twigg <[email protected]>
>>>>>>>
>>>>>> wrote:
>>>>>> Here's my thought for what might work well. Comments very welcome.
>>>>>>>>
>>>>>>>> We construct k decision trees (where k might be 20-100). Each tree
>>>>>>>> has
>>>>>>>> depth roughly 10-30. Rather than construct each tree in a
>>>>>>>> distributed
>>>>>>>> fashion, construct each tree locally by using multiple passes over
>>>>>>>> the
>>>>>>>> bagged sample for that tree (construct in BFS manner, assuming the
>>>>>>>> tree can fit in memory). Do this in parallel k times. I know this is
>>>>>>>> perhaps not the "mapreduce" way, but I'd be interested to hear
>>>>>>>> alternatives. It seems a bad idea to shuffle the data around on
>>>>>>>> every
>>>>>>>> BFS iteration per tree.
>>>>>>>>
>>>>>>>> I'm not planning to work on this right now, but would like to figure
>>>>>>>> out if Mahout is a good platform to work with if I want such an
>>>>>>>> algorithm.
>>>>>>>>
>>>>>>>> Andy
>>>>>>>>
>>>>> --
>>>>> Dr Andy Twigg
>>>>> Junior Research Fellow, St Johns College, Oxford
>>>>> Room 351, Department of Computer Science
>>>>>
>>>>> http://www.cs.ox.ac.uk/people/**andy.twigg/<http://www.cs.ox.ac.uk/people/andy.twigg/>
>>>>> [email protected] | +447799647538
>>>>>
>>>>>
>



--
Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
[email protected] | +447799647538

Re: Out-of-core random forest implementation

Reply via email to