Re: Hardware performance from HADOOP cluster

tim robertson Thu, 15 Oct 2009 12:08:23 -0700

Yeah they are single proc machines and other than setting to 4
map/reduces, completely 0.20.1 vanilla installation.


I will tune it up in the morning based on what I can find on the web
(e.g. cloudera guidelines) and post the results.  I am going to be
running HBase on top of this, but want to make sure the HDFS/MR is
running sound before continuing.

Seems there are a few people at the moment setting up clusters - might
it be worth adding our config and results to
http://wiki.apache.org/hadoop/HardwareBenchmarks ?

For people like me (first cluster set up from scratch - previously
used the EC2 scripts) it is nice to sanity check things look about
right.  The mailing lists suggest there are a few small clusters of
medium spec machines springing up.

Cheers,
Tim




On Thu, Oct 15, 2009 at 5:52 PM, Patrick Angeles
<[email protected]> wrote:
> Hi Tim,
> I assume those are single proc machines?
>
> I got 649 secs on 70GB of data for our 7-node cluster (~11 mins), but we
> have dual quad Nehalems (2.26Ghz).
>
> On Thu, Oct 15, 2009 at 11:34 AM, tim robertson
> <[email protected]>wrote:
>
>> Hi Usmam,
>>
>> So on my 10 node cluster (9 DN) with 4 maps and 4 reduces (I plan on
>> high memory jobs so picked 4 only)
>> [9 DN of Dell R300: 2.83G Quadcore (2x6MB cache), 8G RAM and 2x500G SATA
>> drives]
>>
>> Using your template for stats, I get the following with no tuning:
>>
>> GENERATE RANDOM DATA
>> Wrote out 90GB of random binary data:
>> Map output records=9198009
>> The job took 350 seconds. (approximately: 6 minutes)
>>
>> SORT RANDOM GENERATED DATA
>> Map output records= 9197821
>> Reduce input records=9197821
>> The job took 2176 seconds. (approximately: 36mins).
>>
>> So pretty similar to your initial benchmark.  I will tune it a bit
>> tomorrow and rerun.
>>
>> If you spent time tuning your cluster and it was successful, please
>> can you share your config?
>>
>> Cheers,
>> Tim
>>
>>
>>
>>
>>
>> On Thu, Oct 15, 2009 at 11:32 AM, Usman Waheed <[email protected]> wrote:
>> > Hi Todd,
>> >
>> > Some changes have been applied to the cluster based on the documentation
>> > (URL) you noted below,
>> > like file descriptor settings and io.file.buffer.size. I will check out
>> the
>> > other settings which I haven't applied yet.
>> >
>> > My map/reduce slot settings from my hadoop-site.xml and
>> hadoop-default.xml
>> > on all nodes in the cluster.
>> >
>> > _*hadoop-site.xml
>> > *_mapred.tasktracker.task.maximum = 2
>> > mapred.tasktracker.map.tasks.maximum = 8
>> > mapred.tasktracker.reduce.tasks.maximum = 8
>> > _*
>> > hadoop-default.xml
>> > *_mapred.map.tasks = 2
>> > mapred.reduce.tasks = 1
>> >
>> > Thanks,
>> > Usman
>> >
>> >
>> >> This seems a bit slow for that setup (4-5 MB/sec/node sorting). Have
>> >> you changed the configurations at all? There are some notes on this
>> >> blog post that might help your performance a bit:
>> >>
>> >>
>> >>
>> http://www.cloudera.com/blog/2009/03/30/configuration-parameters-what-can-you-just-ignore/
>> >>
>> >> How many map and reduce slots did you configure for the daemons? If
>> >> you have Ganglia installed you can usually get a good idea of whether
>> >> you're using your resources well by looking at the graphs while
>> >> running a job like this sort.
>> >>
>> >> -Todd
>> >>
>> >> On Wed, Oct 14, 2009 at 4:04 AM, Usman Waheed <[email protected]> wrote:
>> >>
>> >>>
>> >>> Here are the results i got from my 4 node cluster (correction i noted 5
>> >>> earlier). One of my nodes out of the 4 is a namenode+datanode both.
>> >>>
>> >>> GENERATE RANDOM DATA
>> >>> Wrote out 40GB of random binary data:
>> >>> Map output records=4088301
>> >>> The job took 358 seconds. (approximately: 6 minutes).
>> >>>
>> >>> SORT RANDOM GENERATED DATA
>> >>> Map output records=4088301
>> >>> Reduce input records=4088301
>> >>> The job took 2136 seconds. (approximately: 35 minutes).
>> >>>
>> >>> VALIDATION OF SORTED DATA
>> >>> The job took 183 seconds.
>> >>> SUCCESS! Validated the MapReduce framework's 'sort' successfully.
>> >>>
>> >>> It would be interesting to see what performance numbers others with a
>> >>> similar setup have obtained.
>> >>>
>> >>> Thanks,
>> >>> Usman
>> >>>
>> >>>
>> >>>>
>> >>>> I am setting up a new cluster of 10 nodes of 2.83G Quadcore (2x6MB
>> >>>> cache), 8G RAM and 2x500G drives, and will do the same soon.  Got some
>> >>>> issues though so it won't start up...
>> >>>>
>> >>>> Tim
>> >>>>
>> >>>>
>> >>>> On Wed, Oct 14, 2009 at 11:36 AM, Usman Waheed <[email protected]>
>> wrote:
>> >>>>
>> >>>>
>> >>>>>
>> >>>>> Thanks Tim, i will check it out and post my results for comments.
>> >>>>> -Usman
>> >>>>>
>> >>>>>
>> >>>>>>
>> >>>>>> Might it be worth running the http://wiki.apache.org/hadoop/Sortand
>> >>>>>> posting your results for comment?
>> >>>>>>
>> >>>>>> Tim
>> >>>>>>
>> >>>>>>
>> >>>>>> On Wed, Oct 14, 2009 at 10:48 AM, Usman Waheed <[email protected]>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>>
>> >>>>>>> Hi,
>> >>>>>>>
>> >>>>>>> Is there a way to tell what kind of performance numbers one can
>> >>>>>>> expect
>> >>>>>>> out
>> >>>>>>> of their cluster given a certain set of specs.
>> >>>>>>>
>> >>>>>>> For example i have 5 nodes in my cluster that all have the
>> following
>> >>>>>>> hardware configuration(s):
>> >>>>>>> Quad Core 2.0GHz, 8GB RAM, 4x1TB disks and are all on the same
>> rack.
>> >>>>>>>
>> >>>>>>> Thanks,
>> >>>>>>> Usman
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>
>> >>>>>
>> >>>>
>> >>>>
>> >>>
>> >>>
>> >>
>> >>
>> >
>> >
>>
>

Re: Hardware performance from HADOOP cluster

Reply via email to