Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by AshutoshChauhan: http://wiki.apache.org/pig/PigMergeJoin ------------------------------------------------------------------------------ == Multiway Join == This algorithm could theoretically be extended to support joins of three or more inputs. For now it will not be. Pig will give an error if users give more than two - inputs to a merge join. If users wish to do three plus way joins with this algorithm they can decompose their joins into a series of two ways joins. + inputs to a merge join. If users wish to do three plus way joins with this algorithm they can decompose their joins into a series of two ways joins. + ---- + == Phase 2 == + + Phase 1 which got committed in r804310 has few limitations. Those limitations are enumerated below with possible solutions: + + '''Predecessors''' : Only filter and foreach are currently allowed as predecessor of Merge Join. + + MRCompiler maintains state while compiling physical operators. One of them is list of MR jobs which are already created. These MR jobs contain pipeline of physical operator which have already gotten compiled. In case of MergeJoin there are atleast two MR jobs which would have gotten created by the time POMergeJoin is visited. Now POMergeJoin needs to identify which of these MR job corresponds to left input and which one corresponds to right. It does so by matching its predecessor physical operators in the physical plan with the physical operators which are there in compiled MR jobs. But this is not a reliable way. Confusion arises specially when preceding physical operator generated more then one MR job (e.g. in case of order-by). To make Merge Join work in these scenario we need a reliable way of knowing which physical operator belongs to which MR job. A proposal to fix this is to introduce PhyOpToMROp map in spirit of LogToPhyMap. More details at: https://issues.apache.or g/jira/browse/PIG-858 + + '''Sort order''' : Data must be sorted in ascending order. + + In POMergeJoin comparison of keys should be done by comparator which can be set based on user input. + + ''' End-of-All-Input ''' : POMergeJoin needs to know when it is called last time. It does so by checking end of all input flag. Problem is it assumes that when this flag is true that pipeline is running without any input and with status EOP. This holds in all cases except for the case when one of the predecessor of merge join is streaming. Streaming also makes use of end-of-all-input flag and can potentially generate one or more tuples when this flag is set. + + getNext() in POMergeJoin should be updated so that it doesn't make this assumption. + + +
