cannot use a map side join to merge the output of multiple map side joins

Jim Donofrio Sat, 05 May 2012 08:50:48 -0700

I am trying to use a map side join to merge the output of multiple mapside joins. This is failing because of the below code inJobClient.writeOldSplits which reorders the splits from largest tosmallest. Why is that done, is that so that the largest split which willtake the longest gets processed first?

Each map side join then fails to name its part-* files with the samenumber as the incoming partition so files that named part-00000 that gointo the first map side join get outputted to part-00010 while anotherone of the first level map side joins sends files named part-00000 topart-00005. The second level map side join then does not get the inputsplits in partitioner order from each first level map side join outputdirectory.

I can think of only 2 fixes, add some conf property to allow turning offthe below sorting OR extend FileOutputCommitter to rename the outputs ofthe first level map side join to merge_part-the orginal partitionnumber. Any other solutions?


    // sort the splits into order based on size, so that the biggest
    // go first

Arrays.sort(splits, newComparator<org.apache.hadoop.mapred.InputSplit>() {

      public int compare(org.apache.hadoop.mapred.InputSplit a,
                         org.apache.hadoop.mapred.InputSplit b) {
        try {
          long left = a.getLength();
          long right = b.getLength();
          if (left == right) {
            return 0;
          } else if (left < right) {
            return 1;
          } else {
            return -1;
          }

cannot use a map side join to merge the output of multiple map side joins

Reply via email to