Dear users,

hope this is the right list to submit this one, otherwise I apologize. 

I'd like to have your opinion about a problem that I'm facing on MapReduce 
framework. I am writing my code in Java and running on a grid.

I have a textual input structured in <key, value> pairs. My task is to make the 
cartesian product of all the values that have the same key.
I can do so it simply using <key> as my map key, so that every value with the 
same key is put in the same reducer, where I can easily process them and obtain 
the cartesian.

However, my keys are not uniformly distributed, the distribution is very broad. 
As a result of this, the output of my reducers will be very unbalanced and I 
will have many small files (some KB) and a bunch of huge files (tens of GB). A 
sub-optimal yet acceptable approximation of my task would be to make the 
cartesian product of smaller chunks of values for very frequent keys, so that 
the load is distributed evenly among reducers. I am wondering how can I do this 
in the most efficient/elegant way.

It appears to me that using a customized Partitioner is not the right way to 
act, since records with the same key have still to be mapped together (am I 
right?).
The only solution that comes into my mind is to split the key space 
artificially insider the mapper (e.g., for a frequent key "ABC", map the values 
on the reducers using keys like "ABC1", "ABC2", an so on). This would require 
an additional post-processing cleanup phase to retrieve the original keys.

Do you know a better, possibly automatic way to perform this task?

Thank you!
Best

Luca

Reply via email to