Hi,

I have a problem concerning the TeraSort benchmark.
I am running the version that ships with hadoop-0.21.0 and if I use it as 
described (i.e. TeraGen -TeraSort - TeraValidate), everything works fine.

However, for some tests I need to run, I added a simple job between TeraGen and 
TeraSort that does nothing but copy the input. I included its code below. 

If I run this Copy-job after TeraGen, TeraSort will partition the input in a 
way, that most tuples will go to the last reducer. 
For example if I run TeraSort with 500MB input, and 20 Reducers I get the 
following distribution:
-Reducers 0-18 process ~10.000 tuples each
-Reducer 19 processes ~5.000.000 tuples 

Can anyone reproduce this behavior? I would really appreciated any help!

David


public class Copy extends Configured implements Tool {

    public int run(String[] args) throws IOException, InterruptedException, 
ClassNotFoundException {
        Job job = Job.getInstance(new Cluster(getConf()), getConf());
    
        Path inputDirOld = new Path(args[0]);
        TeraInputFormat.addInputPath(job, inputDirOld);
        job.setInputFormatClass(TeraInputFormat.class);
    
        job.setJobName("Copy");
        job.setJarByClass(Void.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        job.setOutputFormatClass(TeraOutputFormat.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        return job.waitForCompletion(true) ? 0 : 1;
                
    }

     public static void main(String[] args) throws Exception {
        int res = ToolRunner.run(new Configuration(), new Void(), args);
        System.exit(res);
     }
}

Reply via email to