Not sure if this would work, or the right approach, but looking into hadoop streaming, ?might? find something?
Cheers James. On 2011-01-06, at 3:27 PM, W.P. McNeill wrote: > Say I have two MapReduce processes, A and B. The two are algorithmically > dissimilar, so they have to be implemented as separate MapReduce processes. > The output of A is used as the input of B, so A has to run first. However, > B doesn't need to take all of A's output as input, only a partition of it. > So in theory A and B could run at the same time in a producer/consumer > arrangement, where B would start to work as soon as A had produced some > output but before A had completed. Obviously, this could be a big > parallelization win. > > Is this possible in MapReduce? I know at the most basic level it is > not–there is no synchronization mechanism that allows the same HDFS > directory to be used for both input and output–but is there some abstraction > layer on top that allows it? I've been digging around, and I think the > answer is "No" but I want to be sure. > > More specifically, the only abstraction layer I'm aware of that chains > together MapReduce processes is Cascade, and I think it requires the reduce > steps to be serialized, but again I'm not sure because I've only read the > documentation and haven't actually played with it.