Re: cannot use a map side join to merge the output of multiple map side joins

JunYong Li Sun, 06 May 2012 23:55:04 -0700

could you complain the problem more clear?

2012/5/5 Jim Donofrio <donofrio...@gmail.com>


> I am trying to use a map side join to merge the output of multiple map
> side joins. This is failing because of the below code in
> JobClient.writeOldSplits which reorders the splits from largest to
> smallest. Why is that done, is that so that the largest split which will
> take the longest gets processed first?
>
> Each map side join then fails to name its part-* files with the same
> number as the incoming partition so files that named part-00000 that go
> into the first map side join get outputted to part-00010 while another one
> of the first level map side joins sends files named part-00000 to
> part-00005. The second level map side join then does not get the input
> splits in partitioner order from each first level map side join output
> directory.
>
> I can think of only 2 fixes, add some conf property to allow turning off
> the below sorting OR extend FileOutputCommitter to rename the outputs of
> the first level map side join to merge_part-the orginal partition number.
> Any other solutions?
>
>    // sort the splits into order based on size, so that the biggest
>    // go first
>    Arrays.sort(splits, new Comparator<org.apache.hadoop.**mapred.InputSplit>()
> {
>      public int compare(org.apache.hadoop.**mapred.InputSplit a,
>                         org.apache.hadoop.mapred.**InputSplit b) {
>        try {
>          long left = a.getLength();
>          long right = b.getLength();
>          if (left == right) {
>            return 0;
>          } else if (left < right) {
>            return 1;
>          } else {
>            return -1;
>          }
>



-- 
Regards
Junyong

Re: cannot use a map side join to merge the output of multiple map side joins

Reply via email to