Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by AshutoshChauhan:
http://wiki.apache.org/pig/PigMergeJoin

------------------------------------------------------------------------------
  
  == Multiway Join ==
  This algorithm could theoretically be extended to support joins of three or 
more inputs.  For now it will not be.  Pig will give an error if users give 
more than two
- inputs to a merge join.  If users wish to do three plus way joins with this 
algorithm they can decompose their joins into a series of two ways joins.  
+ inputs to a merge join.  If users wish to do three plus way joins with this 
algorithm they can decompose their joins into a series of two ways joins.
  
+ ----
+ == Phase 2 ==
+ 
+ Phase 1 which got committed in r804310 has few limitations. Those limitations 
are enumerated below with possible solutions:
+ 
+ '''Predecessors''' :  Only filter and foreach are currently allowed as 
predecessor of Merge Join.
+ 
+ MRCompiler maintains state while compiling physical operators. One of them is 
list of MR jobs which are already created. These MR jobs contain pipeline of 
physical operator which have already gotten compiled. In case of MergeJoin 
there are atleast two MR jobs which would have gotten created by the time 
POMergeJoin is visited. Now POMergeJoin needs to identify which of these MR job 
corresponds to left input and which one corresponds to right. It does so by 
matching its predecessor physical operators in the physical plan with the 
physical operators which are there in compiled MR jobs. But this is not a 
reliable way. Confusion arises specially when preceding physical operator 
generated more then one MR job (e.g. in case of order-by). To make Merge Join 
work in these scenario we need a reliable way of knowing which physical 
operator belongs to which MR job. A proposal to fix this is to introduce 
PhyOpToMROp map in spirit of LogToPhyMap. More details at: 
https://issues.apache.or
 g/jira/browse/PIG-858 
+ 
+ '''Sort order''' : Data must be sorted in ascending order.
+ 
+ In POMergeJoin comparison of keys should be done by comparator which can be 
set based on user input.
+ 
+ ''' End-of-All-Input ''' : POMergeJoin needs to know when it is called last 
time. It does so by checking end of all input flag. Problem is it assumes that 
when this flag is true that pipeline is running without any input and with 
status EOP. This holds in all cases except for the case when one of the 
predecessor of merge join is streaming. Streaming also makes use of 
end-of-all-input flag and can potentially generate one or more tuples when this 
flag is set.
+ 
+ getNext() in POMergeJoin should be updated so that it doesn't make this 
assumption.
+ 
+      
+ 

Reply via email to