Re: Hardware performance from HADOOP cluster

tim robertson Fri, 16 Oct 2009 04:01:39 -0700

Hi all,

Adding the following to core-site.xml, mapred-site.xml and
hdfs-site.xml (based on Cloudera guidelines:
http://tinyurl.com/ykupczu)
  io.sort.factor: 15  (mapred-site.xml)
  io.sort.mb: 150  (mapred-site.xml)
  io.file.buffer.size: 65536   (core-site.xml)
  dfs.datanode.handler.count: 3 (hdfs-site.xml  actually this is the default)


and using the default of HADOOP_HEAPSIZE=1000 (hadoop-env.sh)

Using 2 mappers and 2 reducers, can someone please help me with the
maths as to why my jobs are failing with "Error: Java heap space" in
the maps?
(the same runs fine with io.sort.factor of 10 and io.sort.mb of 100)

io.sort.mb of 200 x 4 (2 mappers, 2 reducers) = 0.8G
Plus the 2 daemons on the node at 1G each = 1.8G
Plus Xmx of 1G for each hadoop daemon task = 5.8G

The machines have 8G in them.  Obviously my maths is screwy somewhere...

Cheers,
Tim



On Fri, Oct 16, 2009 at 9:59 AM, Erik Forsberg <[email protected]> wrote:
> On Thu, 15 Oct 2009 11:32:35 +0200
> Usman Waheed <[email protected]> wrote:
>
>> Hi Todd,
>>
>> Some changes have been applied to the cluster based on the
>> documentation (URL) you noted below,
>
> I would also like to know what settings people are tuning on the
> operating system level. The blog post mentioned here does not mention
> much about that, except for the fileno changes.
>
> We got about 3x the read performance when running DFSIOTest by mounting
> our ext3 filesystems with the noatime parameter. I saw that mentioned
> in the slides from some Cloudera presentation.
>
> (For those who don't know, the noatime parameter turns off the
> recording of access time on files. That's a horrible performance killer
> since it means every read of a file also means that the kernel must do
> a write. These writes are probably queued up, but still, if you don't
> need the atime (very few applications do), turn it off!)
>
> Have people been experimenting with different filesystems, or are most
> of us running on top of ext3?
>
> How about mounting ext3 with "data=writeback"? That's rumoured to give
> the best throughput and could help with write performance. From
> mount(8):
>
>     writeback
>            Data ordering is not preserved - data may be written into the main 
> file system
>            after its metadata has been  committed  to the journal.  This is 
> rumoured to be the
>            highest throughput option.  It guarantees internal file system 
> integrity,
>            however it can allow old data to appear in files after a crash and 
> journal recovery.
>
> How would the HDFS consistency checks cope with old data appearing in
> the unerlying files after a system crash?
>
> Cheers,
> \EF
> --
> Erik Forsberg <[email protected]>
> Developer, Opera Software - http://www.opera.com/
>

Re: Hardware performance from HADOOP cluster

Reply via email to