On a recent project, I found what I believe to be a shortcoming in the current Mapside Join[1] implementation. It appears the current implementation loads all values from all input datasets with a common key into memory and then calls the Mapper multiple times with the Cartesian product of these values. The problem is, when any of your datasets has a very large number of values with the same key, it is possible to exhaust memory when trying to hold all the values in memory.
I have written an alternative implementation of Mapside Join that does not have this problem. Rather than loading all values with a shared key into memory at one time, this implementation calls the Mapper with a Map of Iterators over the values (similar to how values are passed to a Reducer). My employer has graciously agreed to allow me to contribute this implementation back to the open source community. I'm not sure the best way to do so. According to this link[2], my first step is to send a message to this list describing the proposed change. Here is my description: - I propose an alternative implementation of Mapside Join that passes values with a common key to the Mapper using a Map of Iterators rather than calling the Mapper with the Cartesian product of the values. - This implementation is intended to be an alternative, rather than a replacement, to the current Mapside Join implementation. The current implementation contains certain features (for example, the ability to perform inner joins) that my proposed implementation does not. My proposed implementation only supports outer join. I am using Cloudera's CDH3 distribution (which is based on 0.20.2), and I wrote my implementation using the "old" api (org.apache.hadoop.mapred). If I'm not mistaken, at least in 0.20.2, the current MSJ implementation is not supported by the new api. That's why I started with the old api, and when I realized the current MSJ implementation wouldn't work for my situation, I continued using the old api for my alternative implementation. I realize the fact that I'm using the old api and a fairly old version of Hadoop may hinder acceptance. I would be willing to port the implementation to a newer api/release, but to be honest, I'm baffled by Hadoop releases now. I have no idea which release I would target. So, my questions are: - Is there any interest in this contribution? - If so, do I need to port to a different api/release? - I assume I would contribute this via a Jira. Correct? - Is there anything else I should consider that I may be overlooking? Thanks, Stuart [1] https://issues.apache.org/jira/browse/HADOOP-2085 [2] http://wiki.apache.org/hadoop/HowToContribute
