> At Yahoo, we had a framework that was similar to MapReduce called > Dreadnaught. When we were converting applications off of Dreadnaught > to Hadoop MapReduce, we considered supporting M-R-R. (Dreadnaught > imposes few restrictions on the application and could support M, M-R, > M-R-R, etc.)
That actually sounds great. Not open source I assume? :) > The problem is that supporting the retry semantics > arbitrarily far back can cause a single node failure to launch more > and more work. By putting a checkpoint after each reduce (based on the > replica count in HDFS > 1), M-R has bounded amount of rework that can > be required and relatively simple error recovery. Hm. Not sure I understand that problem yet. You mean it's a problem that if you had M-R1-R2-R3 and R3 fails that we would have to restart at R1? That said I am actually not that interested in MMMMR or MRRRR. It just felt natural to thing into that direction. The forking and joining is more interesting. This indeed can be done today with the MultiInput and MultiOutput formats today. But even in the new API that feels bolted on top. Maybe let me explain the use case. I have one data source. In the first mapper I would like to do fan out. But I would like to emit different data types: Mapper: if (a) emit(Text, Integer) if (b) emit(Long, Text) and now I would like to have a Reducer for (a) and a separate Reducer for (b). While reading from the input for each (a) and (b) is possible it too inefficient. Especially if you have a..z for example. Right now the best approach seems to use the MultipleOutputs and then write and output for every forked data path a..b, emit nothing to the "normal" context and start new jobs based on the a..b output. Now with the current API this could probably work by creating two "MultipleOutputs" outputA = new MultipleOutputs<Text, IntWritable>(context); outputB = new MultipleOutputs<LongWritable, Text>(context); which probably should work - but feels a little weird. Aren't these just SingleOutputs and the context that gets passed in to the map function just delegates to the default output? Now to the joining. outputA: Text, Integer outputB: Long, Text Unless I am missing something there is no easy way of joining these two inputs. You would have to come up with a type that can encapsulate both type combinations and do a switch statement in the mapper. Again just not so great from an API perspective IMO. Wouldn't it be great if you could provide a map function per input .... with even the right types in the signature? cheers -- Torsten
