Re: Out-of-core random forest implementation

Sean Owen Fri, 08 Mar 2013 06:04:07 -0800

Weighing in late here -- it's an interesting question whether M/R is a good
fit for iterative processes. Today you can already optimize away most of
the startup overhead for a job, by setting Hadoop to reuse JVMs, increasing
heart-beat rate, etc. I know Ted can add more about how MapR tunes that
further. So I'm not sure it's the overhead of starting jobs per se.

Unless your iterations are < 1 minute or something. And they may be in some
situations. I don't think that's true of, say, ALS or even the RF
implementation I have in mind. Iterations may be really fast at small scale
too, but that's not what Hadoop is for. Or unless your iterations have
significant init overhead, like loading data. That's a good point too.

I think the problem comes if the process is not naturally iterative -- if
it's parallel, but, the workers need not stop to sync up, then forcing them
into an iterative process just wastes time. Most time  you're waiting for a
straggler worker, needlessly. In this regard, I am not sure that a BSP
paradigm is any better? But not sure anyone was pushing BSP.

But I think RF could fit an iterative scheme well. a) I haven't thought it
through completely or tried it, and b) I can imagine reasons it may work in
theory but not in practice. But is this not roughly how you'd do it --
maybe this is what's already being suggested? Roughly:

Break up the data by feature. (Subdivide features if needed.) Map over all
the data and distribute the single-feature data. Reducers compute an
optimal split / decision rule and output their rules. Next iteration:
mappers read the rules and choose a next random feature for each rule. They
map the data, apply each rule to each record, apply the next choice of
feature, and output the single feature again. Reducers will receive, in one
input group, again all the data they need to compute another split. They
output that split. Repeat.

There's a lot of devil in the details there. But this seems like a fine way
to build a tree level by level without sending all the data all over the
place. I suppose the problem is uneven data distribution, and that some
decision trees will go deep. But quickly your number of input groups gets
quite large as they get smaller, so the it ought to even out over reducers
(?).

To answer Marty, I don't this project will never change much from what it
is now. It's not even properly on Hadoop 0.20.x, much less 2.x. An
MRv2-based project is a different project, as it would properly be a total
rewrite. Something to start thinking about and start thinking about drawing
a line under what's here IMHO.

On Fri, Mar 8, 2013 at 1:36 PM, Marty Kube <
[email protected]> wrote:

> What about using one map reduce job per iteration?  The models you load
> into distributed cache are the model from the last round and the reducer
> can emit the expanded model.  We are presumably working with large data
> sets so I would not expect start-up latency to be an issue.
>
>

Re: Out-of-core random forest implementation

Reply via email to