An alternative implementation of Mapside Join

Stuart White Fri, 28 Sep 2012 08:30:18 -0700

On a recent project, I found what I believe to be a shortcoming in the
current Mapside Join[1] implementation.  It appears the current
implementation loads all values from all input datasets with a common
key into memory and then calls the Mapper multiple times with the
Cartesian product of these values.  The problem is, when any of your
datasets has a very large number of values with the same key, it is
possible to exhaust memory when trying to hold all the values in
memory.


I have written an alternative implementation of Mapside Join that does
not have this problem.  Rather than loading all values with a shared
key into memory at one time, this implementation calls the Mapper with
a Map of Iterators over the values (similar to how values are passed
to a Reducer).

My employer has graciously agreed to allow me to contribute this
implementation back to the open source community.  I'm not sure the
best way to do so.  According to this link[2], my first step is to
send a message to this list describing the proposed change.  Here is
my description:

- I propose an alternative implementation of Mapside Join that passes
values with a common key to the Mapper using a Map of Iterators rather
than calling the Mapper with the Cartesian product of the values.
- This implementation is intended to be an alternative, rather than a
replacement, to the current Mapside Join implementation.  The current
implementation contains certain features (for example, the ability to
perform inner joins) that my proposed implementation does not.  My
proposed implementation only supports outer join.

I am using Cloudera's CDH3 distribution (which is based on 0.20.2),
and I wrote my implementation using the "old" api
(org.apache.hadoop.mapred).  If I'm not mistaken, at least in 0.20.2,
the current MSJ implementation is not supported by the new api.
That's why I started with the old api, and when I realized the current
MSJ implementation wouldn't work for my situation, I continued using
the old api for my alternative implementation.

I realize the fact that I'm using the old api and a fairly old version
of Hadoop may hinder acceptance.  I would be willing to port the
implementation to a newer api/release, but to be honest, I'm baffled
by Hadoop releases now.  I have no idea which release I would target.

So, my questions are:

- Is there any interest in this contribution?
- If so, do I need to port to a different api/release?
- I assume I would contribute this via a Jira.  Correct?
- Is there anything else I should consider that I may be overlooking?

Thanks,
Stuart

[1] https://issues.apache.org/jira/browse/HADOOP-2085
[2] http://wiki.apache.org/hadoop/HowToContribute

An alternative implementation of Mapside Join

Reply via email to