could you complain the problem more clear? 2012/5/5 Jim Donofrio <donofrio...@gmail.com>
> I am trying to use a map side join to merge the output of multiple map > side joins. This is failing because of the below code in > JobClient.writeOldSplits which reorders the splits from largest to > smallest. Why is that done, is that so that the largest split which will > take the longest gets processed first? > > Each map side join then fails to name its part-* files with the same > number as the incoming partition so files that named part-00000 that go > into the first map side join get outputted to part-00010 while another one > of the first level map side joins sends files named part-00000 to > part-00005. The second level map side join then does not get the input > splits in partitioner order from each first level map side join output > directory. > > I can think of only 2 fixes, add some conf property to allow turning off > the below sorting OR extend FileOutputCommitter to rename the outputs of > the first level map side join to merge_part-the orginal partition number. > Any other solutions? > > // sort the splits into order based on size, so that the biggest > // go first > Arrays.sort(splits, new Comparator<org.apache.hadoop.**mapred.InputSplit>() > { > public int compare(org.apache.hadoop.**mapred.InputSplit a, > org.apache.hadoop.mapred.**InputSplit b) { > try { > long left = a.getLength(); > long right = b.getLength(); > if (left == right) { > return 0; > } else if (left < right) { > return 1; > } else { > return -1; > } > -- Regards Junyong