On Wed, 22 Jun 2011 15:16:02 -0700, Steve Lewis <lordjoe2...@gmail.com>
wrote:
> Assume I have two data sources A and B
> Assume I have an input format and can generate key values for both A and
B
> I want an algorithm which will generate the cross product of all values
in
> A
> having the key K and all values in B having the
> key K.
> Currently I use a mapper to generate key values for A and  have the
reducer
> get all values in B with key K and hold them in memory.
> It works but might not scale.
> 
> Any bright ideas?

I was just thinking about a more general version of this problem.

If we think of a mapper as saying "for all <foo> in <set A> do
<something>", then what if we have two inputs?  That is, what if we want to
map over the cartesian product of two input sets?  "for all <(foo,bar)> in
<set A x set B> do <something>"

I'm thinking this should be a new InputFormat, which takes two
InputFormats (with configuration information) as parameters.  Then an
InputSplit for the product input is an input entry from A "times" an
InputSplit from B.  I haven't worked out the details, but this should be
the basic idea.

Thoughts?

Reply via email to