On Wed, 22 Jun 2011 15:16:02 -0700, Steve Lewis <lordjoe2...@gmail.com> wrote: > Assume I have two data sources A and B > Assume I have an input format and can generate key values for both A and B > I want an algorithm which will generate the cross product of all values in > A > having the key K and all values in B having the > key K. > Currently I use a mapper to generate key values for A and have the reducer > get all values in B with key K and hold them in memory. > It works but might not scale. > > Any bright ideas?
I was just thinking about a more general version of this problem. If we think of a mapper as saying "for all <foo> in <set A> do <something>", then what if we have two inputs? That is, what if we want to map over the cartesian product of two input sets? "for all <(foo,bar)> in <set A x set B> do <something>" I'm thinking this should be a new InputFormat, which takes two InputFormats (with configuration information) as parameters. Then an InputSplit for the product input is an input entry from A "times" an InputSplit from B. I haven't worked out the details, but this should be the basic idea. Thoughts?