It sounds OK to me. Yes, I don't think you want to try to parallelize
the building of each tree just to build it off all the data. It will
be too slow.

I imagine the game is to parallelize such that each worker has 1/Nth
of the data, where N is as low as you can manage. Each worker is
building one or more trees from its slice. Then iterate, and split up
the data differently. Each worker can now cross-validate previous
trees based on the new slice, or even build new ones, and iterate
again.

I'm guessing this process will tend to need to build more trees than
usual. I think it's also possible that it will want to try to discard
trees with high out-of-bag error.

Just riffing. I think this is some potential green-field territory to
really think about how you build trees when it doesn't nearly fit in
memory, and what shape of computation gives a really nice tradeoff
between speed and accuracy. I don't think the conventional approach is
that point but it's more like the above.

On Mon, Jan 28, 2013 at 10:14 AM, Andy Twigg <[email protected]> wrote:
> Here's my thought for what might work well. Comments very welcome.
>
> We construct k decision trees (where k might be 20-100). Each tree has
> depth roughly 10-30. Rather than construct each tree in a distributed
> fashion, construct each tree locally by using multiple passes over the
> bagged sample for that tree (construct in BFS manner, assuming the
> tree can fit in memory). Do this in parallel k times. I know this is
> perhaps not the "mapreduce" way, but I'd be interested to hear
> alternatives. It seems a bad idea to shuffle the data around on every
> BFS iteration per tree.
>
> I'm not planning to work on this right now, but would like to figure
> out if Mahout is a good platform to work with if I want such an
> algorithm.
>
> Andy

Reply via email to