Hi Usmam, So on my 10 node cluster (9 DN) with 4 maps and 4 reduces (I plan on high memory jobs so picked 4 only) [9 DN of Dell R300: 2.83G Quadcore (2x6MB cache), 8G RAM and 2x500G SATA drives]
Using your template for stats, I get the following with no tuning: GENERATE RANDOM DATA Wrote out 90GB of random binary data: Map output records=9198009 The job took 350 seconds. (approximately: 6 minutes) SORT RANDOM GENERATED DATA Map output records= 9197821 Reduce input records=9197821 The job took 2176 seconds. (approximately: 36mins). So pretty similar to your initial benchmark. I will tune it a bit tomorrow and rerun. If you spent time tuning your cluster and it was successful, please can you share your config? Cheers, Tim On Thu, Oct 15, 2009 at 11:32 AM, Usman Waheed <[email protected]> wrote: > Hi Todd, > > Some changes have been applied to the cluster based on the documentation > (URL) you noted below, > like file descriptor settings and io.file.buffer.size. I will check out the > other settings which I haven't applied yet. > > My map/reduce slot settings from my hadoop-site.xml and hadoop-default.xml > on all nodes in the cluster. > > _*hadoop-site.xml > *_mapred.tasktracker.task.maximum = 2 > mapred.tasktracker.map.tasks.maximum = 8 > mapred.tasktracker.reduce.tasks.maximum = 8 > _* > hadoop-default.xml > *_mapred.map.tasks = 2 > mapred.reduce.tasks = 1 > > Thanks, > Usman > > >> This seems a bit slow for that setup (4-5 MB/sec/node sorting). Have >> you changed the configurations at all? There are some notes on this >> blog post that might help your performance a bit: >> >> >> http://www.cloudera.com/blog/2009/03/30/configuration-parameters-what-can-you-just-ignore/ >> >> How many map and reduce slots did you configure for the daemons? If >> you have Ganglia installed you can usually get a good idea of whether >> you're using your resources well by looking at the graphs while >> running a job like this sort. >> >> -Todd >> >> On Wed, Oct 14, 2009 at 4:04 AM, Usman Waheed <[email protected]> wrote: >> >>> >>> Here are the results i got from my 4 node cluster (correction i noted 5 >>> earlier). One of my nodes out of the 4 is a namenode+datanode both. >>> >>> GENERATE RANDOM DATA >>> Wrote out 40GB of random binary data: >>> Map output records=4088301 >>> The job took 358 seconds. (approximately: 6 minutes). >>> >>> SORT RANDOM GENERATED DATA >>> Map output records=4088301 >>> Reduce input records=4088301 >>> The job took 2136 seconds. (approximately: 35 minutes). >>> >>> VALIDATION OF SORTED DATA >>> The job took 183 seconds. >>> SUCCESS! Validated the MapReduce framework's 'sort' successfully. >>> >>> It would be interesting to see what performance numbers others with a >>> similar setup have obtained. >>> >>> Thanks, >>> Usman >>> >>> >>>> >>>> I am setting up a new cluster of 10 nodes of 2.83G Quadcore (2x6MB >>>> cache), 8G RAM and 2x500G drives, and will do the same soon. Got some >>>> issues though so it won't start up... >>>> >>>> Tim >>>> >>>> >>>> On Wed, Oct 14, 2009 at 11:36 AM, Usman Waheed <[email protected]> wrote: >>>> >>>> >>>>> >>>>> Thanks Tim, i will check it out and post my results for comments. >>>>> -Usman >>>>> >>>>> >>>>>> >>>>>> Might it be worth running the http://wiki.apache.org/hadoop/Sort and >>>>>> posting your results for comment? >>>>>> >>>>>> Tim >>>>>> >>>>>> >>>>>> On Wed, Oct 14, 2009 at 10:48 AM, Usman Waheed <[email protected]> >>>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Is there a way to tell what kind of performance numbers one can >>>>>>> expect >>>>>>> out >>>>>>> of their cluster given a certain set of specs. >>>>>>> >>>>>>> For example i have 5 nodes in my cluster that all have the following >>>>>>> hardware configuration(s): >>>>>>> Quad Core 2.0GHz, 8GB RAM, 4x1TB disks and are all on the same rack. >>>>>>> >>>>>>> Thanks, >>>>>>> Usman >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> > >
