Weighing in late here -- it's an interesting question whether M/R is a good fit for iterative processes. Today you can already optimize away most of the startup overhead for a job, by setting Hadoop to reuse JVMs, increasing heart-beat rate, etc. I know Ted can add more about how MapR tunes that further. So I'm not sure it's the overhead of starting jobs per se.
Unless your iterations are < 1 minute or something. And they may be in some situations. I don't think that's true of, say, ALS or even the RF implementation I have in mind. Iterations may be really fast at small scale too, but that's not what Hadoop is for. Or unless your iterations have significant init overhead, like loading data. That's a good point too. I think the problem comes if the process is not naturally iterative -- if it's parallel, but, the workers need not stop to sync up, then forcing them into an iterative process just wastes time. Most time you're waiting for a straggler worker, needlessly. In this regard, I am not sure that a BSP paradigm is any better? But not sure anyone was pushing BSP. But I think RF could fit an iterative scheme well. a) I haven't thought it through completely or tried it, and b) I can imagine reasons it may work in theory but not in practice. But is this not roughly how you'd do it -- maybe this is what's already being suggested? Roughly: Break up the data by feature. (Subdivide features if needed.) Map over all the data and distribute the single-feature data. Reducers compute an optimal split / decision rule and output their rules. Next iteration: mappers read the rules and choose a next random feature for each rule. They map the data, apply each rule to each record, apply the next choice of feature, and output the single feature again. Reducers will receive, in one input group, again all the data they need to compute another split. They output that split. Repeat. There's a lot of devil in the details there. But this seems like a fine way to build a tree level by level without sending all the data all over the place. I suppose the problem is uneven data distribution, and that some decision trees will go deep. But quickly your number of input groups gets quite large as they get smaller, so the it ought to even out over reducers (?). To answer Marty, I don't this project will never change much from what it is now. It's not even properly on Hadoop 0.20.x, much less 2.x. An MRv2-based project is a different project, as it would properly be a total rewrite. Something to start thinking about and start thinking about drawing a line under what's here IMHO. On Fri, Mar 8, 2013 at 1:36 PM, Marty Kube < [email protected]> wrote: > What about using one map reduce job per iteration? The models you load > into distributed cache are the model from the last round and the reducer > can emit the expanded model. We are presumably working with large data > sets so I would not expect start-up latency to be an issue. > >
