For the data joins, I let the framework do it - which means one
partition per split - so I have to chose my partition count carefully to
fill the machines.
I had an error in my initial outer join mapper, the join map code now
runs about 40x faster than the old brute force read it all shuffle & sort.
Chris Douglas wrote:
Hi Jason-
It only seems like full outer or full inner joins are supported. I
was hoping to just do a left outer join.
Is this supported or planned?
The full inner/outer joins are examples, really. You can define your
own operations by extending o.a.h.mapred.join.JoinRecordReader or
o.a.h.mapred.join.MultiFilterRecordReader and registering your new
identifier with the parser by defining a property
"mapred.join.define.<ident>" as your class.
For a left outer join, JoinRecordReader is the correct base.
InnerJoinRecordReader and OuterJoinRecordReader should make its use
clear.
On the flip side doing the Outer Join is about 8x faster than doing a
map/reduce over our dataset.
Cool! Out of curiosity, how are you managing your splits? -C