Fwd: TeraSort bug?

David Saile Mon, 28 Feb 2011 02:30:47 -0800

Sorry list, please ignore the previous mail!

I really have to apologize for this!



Anfang der weitergeleiteten E-Mail:

> Von: David Saile <[email protected]>
> Datum: 28. Februar 2011 11:28:20 MEZ
> An: David Saile <[email protected]>
> Betreff: Re: TeraSort bug?
> 
> Hallo Ralf,
> 
> Leider habe ich noch keine Antwort von der Mailinglist. Hättest du heute Zeit 
> für eine kurze Lagebesprechung?
> 
> Ich bin ab ca 14:15 an der Uni. Wir können uns aber auch erst wieder um 16h 
> treffe, wenn dir das besser passt.
> 
> Grüße
> David
> 
>   
> Am 27.02.2011 um 23:33 schrieb David Saile:
> 
>> Hallo Ralf,
>> 
>> Das Cluster läuft einigermaßen stabil, aber ich habe mich die letzten 2-3 
>> Tage mit dem Problem rumgeschlagen, dass bei TeraSortDelta fast alle Tuple  
>> dem selben Reducer  zugewiesen werden (hatte ich Donnerstag kurz erläutert).
>> Ich denke das Problem ist jedoch Teil von TeraSort oder Hadoop, da ich das 
>> Problem auch bei einem simplen Copy-Job reproduzieren konnte. 
>> 
>> Ich habe die unten angehängte Email gerade an die Hadoop-Mailinglist 
>> geschickt. Ich hoffe dass ich da bis morgen Antwort bekomme. 
>> 
>> Wie soll ich bis dahin weiter verfahren? Soll ich versuchen eine Art 
>> Pipeline-Job zu implementieren?
>> 
>> Viele Grüße,
>> David
>> 
>> 
>> Anfang der weitergeleiteten E-Mail:
>> 
>>> Von: David Saile <[email protected]>
>>> Datum: 27. Februar 2011 23:27:16 MEZ
>>> An: [email protected]
>>> Betreff: TeraSort bug?
>>> Antwort an: [email protected]
>>> 
>>> Hi,
>>> 
>>> I have a problem concerning the TeraSort benchmark.
>>> I am running the version that ships with hadoop-0.21.0 and if I use it as 
>>> described (i.e. TeraGen -TeraSort - TeraValidate), everything works fine.
>>> 
>>> However, for some tests I need to run, I added a simple job between TeraGen 
>>> and TeraSort that does nothing but copy the input. I included its code 
>>> below. 
>>> 
>>> If I run this Copy-job after TeraGen, TeraSort will partition the input in 
>>> a way, that most tuples will go to the last reducer. 
>>> For example if I run TeraSort with 500MB input, and 20 Reducers I get the 
>>> following distribution:
>>> -Reducers 0-18 process ~10.000 tuples each
>>> -Reducer 19 processes ~5.000.000 tuples 
>>> 
>>> Can anyone reproduce this behavior? I would really appreciated any help!
>>> 
>>> David
>>> 
>>> 
>>> public class Copy extends Configured implements Tool {
>>> 
>>>    public int run(String[] args) throws IOException, InterruptedException, 
>>> ClassNotFoundException {
>>>     Job job = Job.getInstance(new Cluster(getConf()), getConf());
>>> 
>>>     Path inputDirOld = new Path(args[0]);
>>>     TeraInputFormat.addInputPath(job, inputDirOld);
>>>     job.setInputFormatClass(TeraInputFormat.class);
>>> 
>>>     job.setJobName("Copy");
>>>     job.setJarByClass(Void.class);
>>>     job.setMapOutputKeyClass(Text.class);
>>>     job.setMapOutputValueClass(Text.class);
>>>     
>>>     FileOutputFormat.setOutputPath(job, new Path(args[1]));
>>>     job.setOutputFormatClass(TeraOutputFormat.class);
>>>     job.setOutputKeyClass(Text.class);
>>>     job.setOutputValueClass(Text.class);
>>> 
>>>     return job.waitForCompletion(true) ? 0 : 1;
>>>             
>>>    }
>>> 
>>>     public static void main(String[] args) throws Exception {
>>>     int res = ToolRunner.run(new Configuration(), new Void(), args);
>>>     System.exit(res);
>>>     }
>>> }
>> 
>

Fwd: TeraSort bug?

Reply via email to