Can you pre-process the data to adhere to a uniform serialization scheme first?
Dir 1: <k, Writable(x)> to <k, x> to <k, Avro(x)> Dir 2: <k, Avro(y)> to <k, Avro(y)> or Dir 1: <k, Writable(x)> to <k, Writable(x)> Dir 2: <k, Avro(y)> to <k, y> to <k, Writable(y)> Next, do a reduce side join. To the best of my knowledge, Hadoop does not allow multiple types for values in the reduce side. On Tue, Jun 28, 2011 at 5:53 PM, W.P. McNeill <[email protected]> wrote: > I have two directories. Directory 1 contains values of the form <k, x> and > directory 2 contains values of the form <k, y>. The key values are the > same > in the two directories. I want to take them as input and produce output of > the form <k, f(x,y)>. A reasonable strategy is to do a reduce-side Join as > described in section 3.5.1 of *Data-Intensive Text Processing with > MapReduce< > http://www.amazon.com/Data-Intensive-Processing-MapReduce-Synthesis-Technologies/dp/1608453421 > > > *. > > This works fine if x and y are of the same type (e.g. they're both Text). > It > also works if they are different types but both Writable (maybe x is Text > and y is IntWritable), because you can still create a a Writable object > that > wraps both of them and use that as the value type for both input > directories. > > However, what if x is Writable and y is serialized with some other scheme, > say Avro? It seems like you couldn't write a MapReduce process to > generate <k, f(x,y)>, because the process can only specify a single > serialization scheme for its value. Is there a way to write a MapReduce > process to do a reduce-side join in this case? >
