Hi, I am working on a research project optimizing Join algorithms implemented in MapReduce.
My belief is that Pig currently has three types of Join implemented, the Replicated Join, Skewed Join and Merge Join. From my understanding reading the documentation, it seems that both Replicated and Merge Join are map side Joins and Skewed Join is a reduce side join? Overall, I have a few questions, 1. Does replicated Join requires the data sets to be sorted? (I know merge join requires sorted datasets) 2. Can anyone point me to the actual implementation of the Map Reduce program that is generated by Pig with these three different kinds of joins? Or the code that maps Pig to Hadoop Map Reduce Join algorithm? I found the POMergeJoin, POSkewed Join, but I still couldn't figure out how the actual MapReduce implementation would look like? Thanks Yunming
