Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by PradeepKamath: http://wiki.apache.org/pig/PigMergeJoin ------------------------------------------------------------------------------ == Pre conditions for merge join == In the first release merge join will only work under following conditions: * Both inputs are sorted in *ascending* order of join keys. If an input consists of many files, there should be a total ordering across the files in the ascending order of filename. So for example if one of the inputs to the join is a directory called input1 with files a and b under it, the data should be sorted in ascending order of join key when read starting at a and ending in b. Likewise if an input directory has part files part-00000, part-00001, part-00002 and part-00003, the data should be sorted if the files are read in the sequence part-00000, part-00001, part-00002 and part-00003. + * Each part file of the sorted input should have a size of at least 1 hdfs block size (for example if the hdfs block size is 128 MB, each part file should be > 128 MB). If the total input size (including all part files) is < a blocksize, then the part files should be uniform in size (without large skews in sizes). * The merge join only has two inputs * The loadfunc for the right input of the join should implement the SamplableLoader interface (PigStorage does implement the SamplableLoader interface). * Only inner join will be supported
