It sounds OK to me. Yes, I don't think you want to try to parallelize the building of each tree just to build it off all the data. It will be too slow.
I imagine the game is to parallelize such that each worker has 1/Nth of the data, where N is as low as you can manage. Each worker is building one or more trees from its slice. Then iterate, and split up the data differently. Each worker can now cross-validate previous trees based on the new slice, or even build new ones, and iterate again. I'm guessing this process will tend to need to build more trees than usual. I think it's also possible that it will want to try to discard trees with high out-of-bag error. Just riffing. I think this is some potential green-field territory to really think about how you build trees when it doesn't nearly fit in memory, and what shape of computation gives a really nice tradeoff between speed and accuracy. I don't think the conventional approach is that point but it's more like the above. On Mon, Jan 28, 2013 at 10:14 AM, Andy Twigg <[email protected]> wrote: > Here's my thought for what might work well. Comments very welcome. > > We construct k decision trees (where k might be 20-100). Each tree has > depth roughly 10-30. Rather than construct each tree in a distributed > fashion, construct each tree locally by using multiple passes over the > bagged sample for that tree (construct in BFS manner, assuming the > tree can fit in memory). Do this in parallel k times. I know this is > perhaps not the "mapreduce" way, but I'd be interested to hear > alternatives. It seems a bad idea to shuffle the data around on every > BFS iteration per tree. > > I'm not planning to work on this right now, but would like to figure > out if Mahout is a good platform to work with if I want such an > algorithm. > > Andy
