Memory mapping would help, but a proper scalable random forest implementation should be the goal imo, especially given the claims of mahout.
I am having a hard time convincing myself that doing it on hadoop is the best idea (and like I said, it's not like there are other libraries that would make life much easier). Andy On 15 February 2013 13:47, Marty Kube <[email protected]> wrote: > The hurdle I faced on my current project is that we had many random forest > and the RAM requirements per JVM during classification were too big so we > could not use Mahout. > > We went to a memory mapped forest representation which worked out nicely. > Is that a feature Mahout could use? > > > On 02/15/2013 03:09 AM, deneche abdelhakim wrote: >> >> On Fri, Feb 15, 2013 at 1:06 AM, Marty Kube < >> [email protected]> wrote: >> >>> On 01/28/2013 02:33 PM, Ted Dunning wrote: >>> >>>> I think I was suggesting something weaker. >>>> >>>> I was suggesting that trees get built against a portion of the data and >>>> each node builds some number of trees against just the data it sees. >>>> This >>>> is in the spirit of random forests, but not the letter. >>>> >>> I'm looking at the Mahout partial implementation. It appears to me that >>> some number of trees get built against the portion of the data each node >>> sees based on the map reduce input split. >>> Am I wrong here? If not, Mahout already has an out-of-core >>> implementation. >>> >> Indeed, each mapper will grow a subset of the forest using only the data >> passed to it. Actually, this was based on a suggestion by Ted: >> >> "modify the original algorithm to build multiple trees for different >> portions of the data. That loses some of the solidity of the original >> method, but could actually do better if the splits exposed non-stationary >> behavior." >> >> >>> BTW - where did the name Partial Implementation come from? Is this >>> partially completed work :-) >> >> >> Ha ha ha, the name came from the fact that each mapper/node has a >> "partial" >> view of the dataset. But I do agree, that it's partially completed :P I >> guess they were never enough user interest in Mahout's Decision Forest >> implementation to motivate me to keep working on it, but I do see more and >> more questions on the mailing lists about this, maybe I will get motivated >> enough to work on this again. >> >> I did some research, some time ago, about how to grow the trees on a >> cluster using all data when the dataset is too big to fit in memory. Two >> papers got my attention: >> >> *RainForest - A Framework for Fast Decision Tree Construction of Large >> Datasets [1]* >> This paper describes how to grow the trees without loading all data in >> memory. Using this, you need to copy the dataset on all computing nodes. >> >> *PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce [2]* >> Although it uses map-reduce to grow the trees, it uses lot's of hacks and >> tricks to make it work. >> >> Shouldn't be too difficult to use [1] and improve the current >> implementation, this way the forest can be grown on "full" large dataset >> on >> a computing cluster. One question though, do we really need Hadoop for >> this >> ? I mean, as long as each node has access to the dataset files this should >> be enough to build the forest on any number of nodes. >> >> [1] http://www.cs.cornell.edu/johannes/papers/1998/vldb1998-rainforest.pdf >> [2] http://dejanseo.com.au/research/google/36296.pdf >> >> >>>> Another option is that all the nodes start some trees based on their own >>>> data and spit the set of all such trees to a central (possibly >>>> distributed >>>> store). They also grab some number of trees back down from the central >>>> store and try elaborating those trees using their own data. These will >>>> also be sent back to the central store. This is closer to random >>>> forests. >>>> >>>> A third option is like option two, but the trees that get pulled down >>>> and >>>> spit back up collect statistics until enough nodes report such stats at >>>> which point a new branch is added and the stats are reset. This is very >>>> close to real random forests. It should also be efficient because each >>>> node is passing many trees across its own data in each pass. If that >>>> data >>>> fits in memory (even in file cache) then the pass will be very fast. If >>>> it >>>> doesn't fit in memory, then it will run as fast as the disk will let it >>>> run. Allowing a tree to commit with only 90% coverage should allow good >>>> robustness against laggards while doing almost exactly the same thing as >>>> the real random forest algorithm. >>>> >>>> On Mon, Jan 28, 2013 at 8:56 AM, Andy Twigg <[email protected]> >>>> wrote: >>>> >>>> Ted, >>>>> >>>>> Sorry, I don't understand. Are you suggesting that a single decision >>>>> tree can be built efficiently in a distributed fashion? >>>>> >>>>> The following 2 ways seem like the naive ways of doing this: >>>>> 1) each machine constructs one node of the tree, by scanning all the >>>>> data, filtering those that don't get to its node (this is expensive), >>>>> computing the split and then writing the split out. This requires a >>>>> pass through the data for each node of the tree. >>>>> 2) as 1), except that each machine writes out the filtered data after >>>>> its node in the tree. This requires less scanning (as much as the >>>>> local case I described earlier), but lots of data movement. >>>>> >>>>> Do you have an alternative method? >>>>> >>>>> Andy >>>>> >>>>> >>>>> On 28 January 2013 16:42, Ted Dunning <[email protected]> wrote: >>>>> >>>>>> IF we have a step which permutes data (once) then I doubt that >>>>>> redistribution is necessary. At that point the randomness consists of >>>>>> building trees based on different variable subsets and data subsets. >>>>>> The >>>>>> original random forests only split on variable subsets. How much this >>>>>> matters is an open question. >>>>>> >>>>>> On Mon, Jan 28, 2013 at 2:36 AM, Sean Owen <[email protected]> wrote: >>>>>> >>>>>> It sounds OK to me. Yes, I don't think you want to try to >>>>>> parallelize >>>>>>> >>>>>>> the building of each tree just to build it off all the data. It will >>>>>>> be too slow. >>>>>>> >>>>>>> I imagine the game is to parallelize such that each worker has 1/Nth >>>>>>> of the data, where N is as low as you can manage. Each worker is >>>>>>> building one or more trees from its slice. Then iterate, and split up >>>>>>> the data differently. Each worker can now cross-validate previous >>>>>>> trees based on the new slice, or even build new ones, and iterate >>>>>>> again. >>>>>>> >>>>>>> I'm guessing this process will tend to need to build more trees than >>>>>>> usual. I think it's also possible that it will want to try to discard >>>>>>> trees with high out-of-bag error. >>>>>>> >>>>>>> Just riffing. I think this is some potential green-field territory to >>>>>>> really think about how you build trees when it doesn't nearly fit in >>>>>>> memory, and what shape of computation gives a really nice tradeoff >>>>>>> between speed and accuracy. I don't think the conventional approach >>>>>>> is >>>>>>> that point but it's more like the above. >>>>>>> >>>>>>> On Mon, Jan 28, 2013 at 10:14 AM, Andy Twigg <[email protected]> >>>>>>> >>>>>> wrote: >>>>>> Here's my thought for what might work well. Comments very welcome. >>>>>>>> >>>>>>>> We construct k decision trees (where k might be 20-100). Each tree >>>>>>>> has >>>>>>>> depth roughly 10-30. Rather than construct each tree in a >>>>>>>> distributed >>>>>>>> fashion, construct each tree locally by using multiple passes over >>>>>>>> the >>>>>>>> bagged sample for that tree (construct in BFS manner, assuming the >>>>>>>> tree can fit in memory). Do this in parallel k times. I know this is >>>>>>>> perhaps not the "mapreduce" way, but I'd be interested to hear >>>>>>>> alternatives. It seems a bad idea to shuffle the data around on >>>>>>>> every >>>>>>>> BFS iteration per tree. >>>>>>>> >>>>>>>> I'm not planning to work on this right now, but would like to figure >>>>>>>> out if Mahout is a good platform to work with if I want such an >>>>>>>> algorithm. >>>>>>>> >>>>>>>> Andy >>>>>>>> >>>>> -- >>>>> Dr Andy Twigg >>>>> Junior Research Fellow, St Johns College, Oxford >>>>> Room 351, Department of Computer Science >>>>> >>>>> http://www.cs.ox.ac.uk/people/**andy.twigg/<http://www.cs.ox.ac.uk/people/andy.twigg/> >>>>> [email protected] | +447799647538 >>>>> >>>>> > -- Dr Andy Twigg Junior Research Fellow, St Johns College, Oxford Room 351, Department of Computer Science http://www.cs.ox.ac.uk/people/andy.twigg/ [email protected] | +447799647538 On 15 February 2013 13:47, Marty Kube <[email protected]> wrote: > The hurdle I faced on my current project is that we had many random forest > and the RAM requirements per JVM during classification were too big so we > could not use Mahout. > > We went to a memory mapped forest representation which worked out nicely. > Is that a feature Mahout could use? > > > On 02/15/2013 03:09 AM, deneche abdelhakim wrote: >> >> On Fri, Feb 15, 2013 at 1:06 AM, Marty Kube < >> [email protected]> wrote: >> >>> On 01/28/2013 02:33 PM, Ted Dunning wrote: >>> >>>> I think I was suggesting something weaker. >>>> >>>> I was suggesting that trees get built against a portion of the data and >>>> each node builds some number of trees against just the data it sees. >>>> This >>>> is in the spirit of random forests, but not the letter. >>>> >>> I'm looking at the Mahout partial implementation. It appears to me that >>> some number of trees get built against the portion of the data each node >>> sees based on the map reduce input split. >>> Am I wrong here? If not, Mahout already has an out-of-core >>> implementation. >>> >> Indeed, each mapper will grow a subset of the forest using only the data >> passed to it. Actually, this was based on a suggestion by Ted: >> >> "modify the original algorithm to build multiple trees for different >> portions of the data. That loses some of the solidity of the original >> method, but could actually do better if the splits exposed non-stationary >> behavior." >> >> >>> BTW - where did the name Partial Implementation come from? Is this >>> partially completed work :-) >> >> >> Ha ha ha, the name came from the fact that each mapper/node has a >> "partial" >> view of the dataset. But I do agree, that it's partially completed :P I >> guess they were never enough user interest in Mahout's Decision Forest >> implementation to motivate me to keep working on it, but I do see more and >> more questions on the mailing lists about this, maybe I will get motivated >> enough to work on this again. >> >> I did some research, some time ago, about how to grow the trees on a >> cluster using all data when the dataset is too big to fit in memory. Two >> papers got my attention: >> >> *RainForest - A Framework for Fast Decision Tree Construction of Large >> Datasets [1]* >> This paper describes how to grow the trees without loading all data in >> memory. Using this, you need to copy the dataset on all computing nodes. >> >> *PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce [2]* >> Although it uses map-reduce to grow the trees, it uses lot's of hacks and >> tricks to make it work. >> >> Shouldn't be too difficult to use [1] and improve the current >> implementation, this way the forest can be grown on "full" large dataset >> on >> a computing cluster. One question though, do we really need Hadoop for >> this >> ? I mean, as long as each node has access to the dataset files this should >> be enough to build the forest on any number of nodes. >> >> [1] http://www.cs.cornell.edu/johannes/papers/1998/vldb1998-rainforest.pdf >> [2] http://dejanseo.com.au/research/google/36296.pdf >> >> >>>> Another option is that all the nodes start some trees based on their own >>>> data and spit the set of all such trees to a central (possibly >>>> distributed >>>> store). They also grab some number of trees back down from the central >>>> store and try elaborating those trees using their own data. These will >>>> also be sent back to the central store. This is closer to random >>>> forests. >>>> >>>> A third option is like option two, but the trees that get pulled down >>>> and >>>> spit back up collect statistics until enough nodes report such stats at >>>> which point a new branch is added and the stats are reset. This is very >>>> close to real random forests. It should also be efficient because each >>>> node is passing many trees across its own data in each pass. If that >>>> data >>>> fits in memory (even in file cache) then the pass will be very fast. If >>>> it >>>> doesn't fit in memory, then it will run as fast as the disk will let it >>>> run. Allowing a tree to commit with only 90% coverage should allow good >>>> robustness against laggards while doing almost exactly the same thing as >>>> the real random forest algorithm. >>>> >>>> On Mon, Jan 28, 2013 at 8:56 AM, Andy Twigg <[email protected]> >>>> wrote: >>>> >>>> Ted, >>>>> >>>>> Sorry, I don't understand. Are you suggesting that a single decision >>>>> tree can be built efficiently in a distributed fashion? >>>>> >>>>> The following 2 ways seem like the naive ways of doing this: >>>>> 1) each machine constructs one node of the tree, by scanning all the >>>>> data, filtering those that don't get to its node (this is expensive), >>>>> computing the split and then writing the split out. This requires a >>>>> pass through the data for each node of the tree. >>>>> 2) as 1), except that each machine writes out the filtered data after >>>>> its node in the tree. This requires less scanning (as much as the >>>>> local case I described earlier), but lots of data movement. >>>>> >>>>> Do you have an alternative method? >>>>> >>>>> Andy >>>>> >>>>> >>>>> On 28 January 2013 16:42, Ted Dunning <[email protected]> wrote: >>>>> >>>>>> IF we have a step which permutes data (once) then I doubt that >>>>>> redistribution is necessary. At that point the randomness consists of >>>>>> building trees based on different variable subsets and data subsets. >>>>>> The >>>>>> original random forests only split on variable subsets. How much this >>>>>> matters is an open question. >>>>>> >>>>>> On Mon, Jan 28, 2013 at 2:36 AM, Sean Owen <[email protected]> wrote: >>>>>> >>>>>> It sounds OK to me. Yes, I don't think you want to try to >>>>>> parallelize >>>>>>> >>>>>>> the building of each tree just to build it off all the data. It will >>>>>>> be too slow. >>>>>>> >>>>>>> I imagine the game is to parallelize such that each worker has 1/Nth >>>>>>> of the data, where N is as low as you can manage. Each worker is >>>>>>> building one or more trees from its slice. Then iterate, and split up >>>>>>> the data differently. Each worker can now cross-validate previous >>>>>>> trees based on the new slice, or even build new ones, and iterate >>>>>>> again. >>>>>>> >>>>>>> I'm guessing this process will tend to need to build more trees than >>>>>>> usual. I think it's also possible that it will want to try to discard >>>>>>> trees with high out-of-bag error. >>>>>>> >>>>>>> Just riffing. I think this is some potential green-field territory to >>>>>>> really think about how you build trees when it doesn't nearly fit in >>>>>>> memory, and what shape of computation gives a really nice tradeoff >>>>>>> between speed and accuracy. I don't think the conventional approach >>>>>>> is >>>>>>> that point but it's more like the above. >>>>>>> >>>>>>> On Mon, Jan 28, 2013 at 10:14 AM, Andy Twigg <[email protected]> >>>>>>> >>>>>> wrote: >>>>>> Here's my thought for what might work well. Comments very welcome. >>>>>>>> >>>>>>>> We construct k decision trees (where k might be 20-100). Each tree >>>>>>>> has >>>>>>>> depth roughly 10-30. Rather than construct each tree in a >>>>>>>> distributed >>>>>>>> fashion, construct each tree locally by using multiple passes over >>>>>>>> the >>>>>>>> bagged sample for that tree (construct in BFS manner, assuming the >>>>>>>> tree can fit in memory). Do this in parallel k times. I know this is >>>>>>>> perhaps not the "mapreduce" way, but I'd be interested to hear >>>>>>>> alternatives. It seems a bad idea to shuffle the data around on >>>>>>>> every >>>>>>>> BFS iteration per tree. >>>>>>>> >>>>>>>> I'm not planning to work on this right now, but would like to figure >>>>>>>> out if Mahout is a good platform to work with if I want such an >>>>>>>> algorithm. >>>>>>>> >>>>>>>> Andy >>>>>>>> >>>>> -- >>>>> Dr Andy Twigg >>>>> Junior Research Fellow, St Johns College, Oxford >>>>> Room 351, Department of Computer Science >>>>> >>>>> http://www.cs.ox.ac.uk/people/**andy.twigg/<http://www.cs.ox.ac.uk/people/andy.twigg/> >>>>> [email protected] | +447799647538 >>>>> >>>>> > -- Dr Andy Twigg Junior Research Fellow, St Johns College, Oxford Room 351, Department of Computer Science http://www.cs.ox.ac.uk/people/andy.twigg/ [email protected] | +447799647538
