Hi Arun, Perfect. Thanks for the help.
Best regards, Danny ----- Original Message ----- From: Arun C Murthy <a...@yahoo-inc.com> To: common-user@hadoop.apache.org <common-user@hadoop.apache.org> Cc: core-u...@hadoop.apache.org <core-u...@hadoop.apache.org> Sent: Mon Jun 29 14:59:54 2009 Subject: Re: Teragen defaults to 2 maps; terasort defaults to 1 reducer These are due to the default #maps/#reduces in Map-Reduce. Use: $ bin/hadoop jar hadoop-*-dev-examples.jar teragen - Dmapred.map.tasks=8000 10000000000 /tera/in $ bin/hadoop jar hadoop-*-dev-examples.jar terasort - Dmapred.reduce.tasks=5300 /tera/in /tera/out Arun On Jun 29, 2009, at 2:03 PM, Gross, Danny wrote: > Hello all, > > > > I'm trying to run the hadoop-1.19.1-examples.jar teragen and terasort > programs on a cluster. I have two problems with these programs: > > > > 1. The data is generated in a fashion to where it is not balanced > across my cluster. This is because the data is generated with 2 maps. > > * With the command "hadoop jar hadoop-0.19.1-examples.jar > teragen 1000000000 /terasort" (or any size) per the example doc, I > get > 2 maps. With replication set to 2, this tends to place data more > heavily on 2 of my nodes, and the cluster believes it is balanced. > > > > 2. The terasort program runs out of disk space on the reduce > operation. This is because the program runs with a single reduce > task. > > > * When running "hadoop jar hadoop-0.19.1-examples.jar > terasort /terasort /out" per the example doc, I get the appropriate > number of maps, but one reduce. I've scoured the web and the new > Hadoop > book, and I'm just not able to change the number of reducers. An > example attempt was with the command "hadoop jar > -Dmapred.reduce.tasks=16 hadoop-0.19.1-examples.jar terasort /terasort > /out". > > > > Could anyone help shed some light on how to modify the execution of > these programs to more appropriately balance the data, and spread the > reduce load out across my cluster? > > > > Best regards, > > > > Danny Gross > > >