Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by PradeepKamath:
http://wiki.apache.org/pig/PigMergeJoin

------------------------------------------------------------------------------
  This design will work for inner joins, and with slight modifications for left 
outer joins.  It will not work for right outer or full outer joins.  If we wish 
to extend it to work for those cases at some point in the future, it will have 
to be modified to also sample the left input.  The reason for this is that in 
the current implementation !POMergeJoin does not know how far past the end of 
its input to keep accepting non-matching keys on the right side.  It will need 
to know what key the next block of the left input starts on in order to 
determine when it should stop reading keys from the right input.  A sampling 
pass on the left input that reads the first key of each block could provide 
this information. (Is the intent that each map task will at the end of its 
input continue reading keys from the right side till the first key in the next 
block and perform the outer join - for the outer join for the first key in the 
next block onwards the map task corresponding to that bloc
 k will handle the processing. The extra corner case is the for the first key 
on the left input the outer join for the all the right keys less than that key 
will need to be done by the map task processing the first key (the first key 
would be the first entry in the index for the left side)
  Perhaps a figure might help illustrate:
  
-  Left input   Right input
+ Left input block 1:
+ || 25 ||
+ || .. ||
+ || 35 ||
  
- || 25 ||      || 10 ||
- || .. ||      || .. ||
- || 35 ||      || 24 ||
-               || 25 || 
-               || .. ||
- || 45 ||      || 35 || 
- || .. ||      || .. ||
- || 65 ||      || 44 ||
-               || 45 ||
-               || .. ||
+ Left input block 2:
+ || 45 ||
+ || .. ||
+ || 65 ||
+ 
+ Right input
+ || 10 ||
+ || .. ||
+ || 24 ||
+ || 25 || 
+ || .. ||
+ || 35 || 
+ || .. ||
+ || 44 ||
+ || 45 ||
+ || .. ||
+ The first map would need to know it is the first map and hence handle the 
outer join for all the values on the right side < the first key (10 upto 24). 
It would then handle the join of all values present in the first block of the 
left input (25 to 35). It would also need to continue to read on the right side 
upto the first value in the next block of the left input (i.e. upto 44 
inclusive). The next map (map on 2nd block) would handle the join of all values 
from  45 to EOF on right side. 
   
  In current implementation (r806281) only inner joins are supported.
  

Reply via email to