Re: MapSide Join and left outer or right outer joins?

Jason Venner Thu, 03 Jul 2008 13:28:58 -0700

Ahh yes, it is absolutely critical that the partitioning in all of thesets be the same :)

We are currently assuming that:
1) Default Partitioner
2) Same Reduce Count
3) Text keys
will guarantee that, but as I mentioned above, we are assuming ;)


Testing is just starting

Chris Douglas wrote:

Sorry, I meant splits (partitions of input data). If you have ndatasets and m splits per dataset, m_i must contain the same keys forall n. So if you're joining two datasets A and B sharing a key k, ifsplit i from A contains any instances of k, (a) split i from A mustcontain all instances of k from A and (b) split i from B must containall instances of k from B. -C
On Jul 3, 2008, at 12:06 PM, Jason Venner wrote:
We are using the default partitioner. I am just about to startverifying my result as it took quite a while to work my way throughthe in-obvious issues of hand writing MapFiles, thinks like the keyand value class are extracted from the jobconf, output key/value.
Question: I looked at the HashPartitioner (which we are using) and akey's partition is simply based on the key.hashCode() %conf.getNumReduces().How will I get equal keys going to different partitions - clearly Ihave an understanding gap.
Thanks!


Chris Douglas wrote:
Forgive me if you already know this, but the correctness of themap-side join is very sensitive to partitioning; if your input insorted but equal keys go to different partitions, your results maybe incorrect. Is your input such that the default partitioning issufficient? Have you verified the correctness of your results? -C
On Jul 2, 2008, at 9:55 PM, Jason Venner wrote:
For the data joins, I let the framework do it - which means onepartition per split - so I have to chose my partition countcarefully to fill the machines.
I had an error in my initial outer join mapper, the join map codenow runs about 40x faster than the old brute force read it allshuffle & sort.
Chris Douglas wrote:
Hi Jason-
It only seems like full outer or full inner joins are supported.I was hoping to just do a left outer join.
Is this supported or planned?
The full inner/outer joins are examples, really. You can defineyour own operations by extendingo.a.h.mapred.join.JoinRecordReader oro.a.h.mapred.join.MultiFilterRecordReader and registering your newidentifier with the parser by defining a property"mapred.join.define.<ident>" as your class.
For a left outer join, JoinRecordReader is the correct base.InnerJoinRecordReader and OuterJoinRecordReader should make itsuse clear.
On the flip side doing the Outer Join is about 8x faster thandoing a map/reduce over our dataset.
Cool! Out of curiosity, how are you managing your splits? -C
--
Jason Venner
Attributor - Program the Web <http://www.attributor.com/>
Attributor is hiring Hadoop Wranglers and coding wizards, contact ifinterested

--
Jason Venner
Attributor - Program the Web <http://www.attributor.com/>

Attributor is hiring Hadoop Wranglers and coding wizards, contact ifinterested

Re: MapSide Join and left outer or right outer joins?

Reply via email to