Dear users, hope this is the right list to submit this one, otherwise I apologize.
I'd like to have your opinion about a problem that I'm facing on MapReduce framework. I am writing my code in Java and running on a grid. I have a textual input structured in <key, value> pairs. My task is to make the cartesian product of all the values that have the same key. I can do so it simply using <key> as my map key, so that every value with the same key is put in the same reducer, where I can easily process them and obtain the cartesian. However, my keys are not uniformly distributed, the distribution is very broad. As a result of this, the output of my reducers will be very unbalanced and I will have many small files (some KB) and a bunch of huge files (tens of GB). A sub-optimal yet acceptable approximation of my task would be to make the cartesian product of smaller chunks of values for very frequent keys, so that the load is distributed evenly among reducers. I am wondering how can I do this in the most efficient/elegant way. It appears to me that using a customized Partitioner is not the right way to act, since records with the same key have still to be mapped together (am I right?). The only solution that comes into my mind is to split the key space artificially insider the mapper (e.g., for a frequent key "ABC", map the values on the reducers using keys like "ABC1", "ABC2", an so on). This would require an additional post-processing cleanup phase to retrieve the original keys. Do you know a better, possibly automatic way to perform this task? Thank you! Best Luca
