Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by PradeepKamath: http://wiki.apache.org/pig/PigMergeJoin ------------------------------------------------------------------------------ This design will work for inner joins, and with slight modifications for left outer joins. It will not work for right outer or full outer joins. If we wish to extend it to work for those cases at some point in the future, it will have to be modified to also sample the left input. The reason for this is that in the current implementation !POMergeJoin does not know how far past the end of its input to keep accepting non-matching keys on the right side. It will need to know what key the next block of the left input starts on in order to determine when it should stop reading keys from the right input. A sampling pass on the left input that reads the first key of each block could provide this information. (Is the intent that each map task will at the end of its input continue reading keys from the right side till the first key in the next block and perform the outer join - for the outer join for the first key in the next block onwards the map task corresponding to that bloc k will handle the processing. The extra corner case is the for the first key on the left input the outer join for the all the right keys less than that key will need to be done by the map task processing the first key (the first key would be the first entry in the index for the left side) Perhaps a figure might help illustrate: - Left input Right input + Left input block 1: + || 25 || + || .. || + || 35 || - || 25 || || 10 || - || .. || || .. || - || 35 || || 24 || - || 25 || - || .. || - || 45 || || 35 || - || .. || || .. || - || 65 || || 44 || - || 45 || - || .. || + Left input block 2: + || 45 || + || .. || + || 65 || + + Right input + || 10 || + || .. || + || 24 || + || 25 || + || .. || + || 35 || + || .. || + || 44 || + || 45 || + || .. || + The first map would need to know it is the first map and hence handle the outer join for all the values on the right side < the first key (10 upto 24). It would then handle the join of all values present in the first block of the left input (25 to 35). It would also need to continue to read on the right side upto the first value in the next block of the left input (i.e. upto 44 inclusive). The next map (map on 2nd block) would handle the join of all values from 45 to EOF on right side. In current implementation (r806281) only inner joins are supported.
