Re: Hardware performance from HADOOP cluster

tim robertson Thu, 15 Oct 2009 08:34:42 -0700

Hi Usmam,

So on my 10 node cluster (9 DN) with 4 maps and 4 reduces (I plan on
high memory jobs so picked 4 only)
[9 DN of Dell R300: 2.83G Quadcore (2x6MB cache), 8G RAM and 2x500G SATA drives]


Using your template for stats, I get the following with no tuning:

GENERATE RANDOM DATA
Wrote out 90GB of random binary data:
Map output records=9198009
The job took 350 seconds. (approximately: 6 minutes)

SORT RANDOM GENERATED DATA
Map output records= 9197821
Reduce input records=9197821
The job took 2176 seconds. (approximately: 36mins).

So pretty similar to your initial benchmark.  I will tune it a bit
tomorrow and rerun.

If you spent time tuning your cluster and it was successful, please
can you share your config?

Cheers,
Tim





On Thu, Oct 15, 2009 at 11:32 AM, Usman Waheed <[email protected]> wrote:
> Hi Todd,
>
> Some changes have been applied to the cluster based on the documentation
> (URL) you noted below,
> like file descriptor settings and io.file.buffer.size. I will check out the
> other settings which I haven't applied yet.
>
> My map/reduce slot settings from my hadoop-site.xml and hadoop-default.xml
> on all nodes in the cluster.
>
> _*hadoop-site.xml
> *_mapred.tasktracker.task.maximum = 2
> mapred.tasktracker.map.tasks.maximum = 8
> mapred.tasktracker.reduce.tasks.maximum = 8
> _*
> hadoop-default.xml
> *_mapred.map.tasks = 2
> mapred.reduce.tasks = 1
>
> Thanks,
> Usman
>
>
>> This seems a bit slow for that setup (4-5 MB/sec/node sorting). Have
>> you changed the configurations at all? There are some notes on this
>> blog post that might help your performance a bit:
>>
>>
>> http://www.cloudera.com/blog/2009/03/30/configuration-parameters-what-can-you-just-ignore/
>>
>> How many map and reduce slots did you configure for the daemons? If
>> you have Ganglia installed you can usually get a good idea of whether
>> you're using your resources well by looking at the graphs while
>> running a job like this sort.
>>
>> -Todd
>>
>> On Wed, Oct 14, 2009 at 4:04 AM, Usman Waheed <[email protected]> wrote:
>>
>>>
>>> Here are the results i got from my 4 node cluster (correction i noted 5
>>> earlier). One of my nodes out of the 4 is a namenode+datanode both.
>>>
>>> GENERATE RANDOM DATA
>>> Wrote out 40GB of random binary data:
>>> Map output records=4088301
>>> The job took 358 seconds. (approximately: 6 minutes).
>>>
>>> SORT RANDOM GENERATED DATA
>>> Map output records=4088301
>>> Reduce input records=4088301
>>> The job took 2136 seconds. (approximately: 35 minutes).
>>>
>>> VALIDATION OF SORTED DATA
>>> The job took 183 seconds.
>>> SUCCESS! Validated the MapReduce framework's 'sort' successfully.
>>>
>>> It would be interesting to see what performance numbers others with a
>>> similar setup have obtained.
>>>
>>> Thanks,
>>> Usman
>>>
>>>
>>>>
>>>> I am setting up a new cluster of 10 nodes of 2.83G Quadcore (2x6MB
>>>> cache), 8G RAM and 2x500G drives, and will do the same soon.  Got some
>>>> issues though so it won't start up...
>>>>
>>>> Tim
>>>>
>>>>
>>>> On Wed, Oct 14, 2009 at 11:36 AM, Usman Waheed <[email protected]> wrote:
>>>>
>>>>
>>>>>
>>>>> Thanks Tim, i will check it out and post my results for comments.
>>>>> -Usman
>>>>>
>>>>>
>>>>>>
>>>>>> Might it be worth running the http://wiki.apache.org/hadoop/Sort and
>>>>>> posting your results for comment?
>>>>>>
>>>>>> Tim
>>>>>>
>>>>>>
>>>>>> On Wed, Oct 14, 2009 at 10:48 AM, Usman Waheed <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Is there a way to tell what kind of performance numbers one can
>>>>>>> expect
>>>>>>> out
>>>>>>> of their cluster given a certain set of specs.
>>>>>>>
>>>>>>> For example i have 5 nodes in my cluster that all have the following
>>>>>>> hardware configuration(s):
>>>>>>> Quad Core 2.0GHz, 8GB RAM, 4x1TB disks and are all on the same rack.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Usman
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Re: Hardware performance from HADOOP cluster

Reply via email to