> From my experience, hadoop loathes swap and you mention that all reduces and 
> mappers are running (8 total) and from the ganglia screenshot I see that you 
> have a thick crest of that purple swap.

I know, it's ugly isn't it :) My understanding is that this is partly due to 
forked processes though.

> If we do the math that means [ map.tasks.max * mapred.child.java.opts ]  +  [ 
> reduce.tasks.max * mapred.child.java.opts ] => or [ 4 * 2.5G ] + [ 4 * 2.5G ] 
> is greater than the amount of physical RAM in the machine.
> This doesn't account for the base tasktracker and datanode process + OS 
> overhead and whatever else may be hoarding resources on the systems.

This makes me feel stupid :) Your right, I've just screwed it down, we'll see 
how it performs now.

> I would play with this ratio, either less maps / reduces max - or lower your 
> child.java.opts so that when you are fully subscribed you are not using more 
> resource than the machine can offer.
> Also, setting mapred.reduce.slowstart.completed.maps   to 1.00 or some other 
> value close to 1 would be one way to guarantee only 4 either maps or reduces 
> to be running at once and address (albeit in a duct tape like way) the 
> oversubscription problem you are seeing (this represents the fractions of 
> maps that should complete before initiating the reduce phase).

This is a new one for me. I get Allen's point that on a multi tenant cluster 
this won't fix the problem, but the default is definitely not a good one. 
Starting reduce tasks as soon as map tasks start running is hardly ever useful, 
and just takes up slots that could be used by others.

Thanks a bunch for the suggestions!

Cheers,
Evert



On Wed, May 11, 2011 at 3:23 AM, Evert Lammerts <[email protected]> wrote:
Hi list,

I notice that whenever our Hadoop installation is put under a heavy load we 
lose one or two (on a total of five) datanodes. This results in IOExceptions, 
and affects the overall performance of the job being run. Can anybody give me 
advise or best practices on a different configuration to increase the 
stability? Below I've included the specs of the cluster, the hadoop related 
config and an example of when which things go wrong. Any help is very much 
appreciated, and if I can provide any other info please let me know.

Cheers,
Evert

== What goes wrong, and when ==

See attached a screenshot of Ganglia when the cluster is under load of a single 
job. This job:
* reads ~1TB from HDFS
* writes ~200GB to HDFS
* runs 288 Mappers and 35 Reducers

When the job runs it takes all available Map and Reduce slots. The system 
starts swapping and there is a short time interval during which most cores are 
in WAIT. After that the job really starts running. At around half way, one or 
two datanodes become unreachable and are marked as dead nodes. The amount of 
under-replicated blocks becomes huge. Then some "java.io.IOException: Could not 
obtain block" are thrown in Mappers. The job does manage to finish successfully 
after around 3.5 hours, but my fear is that when we make the input much larger 
- which we want - the system becomes too unstable to finish the job.

Maybe worth mentioning - never know what might help diagnostics.  We notice 
that memory usage becomes less when we switch our keys from Text to 
LongWritable. Also, the Mappers are done in a fraction of the time. However, 
this for some reason results in much more network traffic and makes Reducers 
extremely slow. We're working on figuring out what causes this.


== The cluster ==

We have a cluster that consists of 6 Sun Thumpers running Hadoop 0.20.2 on 
CentOS 5.5. One of them acts as NN and JT, the other 5 run DN's and TT's. Each 
node has:
* 16GB RAM
* 32GB swapspace
* 4 cores
* 11 LVM's of 4 x 500GB disks (2TB in total) for HDFS
* non-HDFS stuff on separate disks
* a 2x1GE bonded network interface for interconnects
* a 2x1GE bonded network interface for external access

I realize that this is not a well balanced system, but it's what we had 
available for a prototype environment. We're working on putting together a 
specification for a much larger production environment.


== Hadoop config ==

Here some properties that I think might be relevant:

__CORE-SITE.XML__

fs.inmemory.size.mb: 200
mapreduce.task.io.sort.factor: 100
mapreduce.task.io.sort.mb: 200
# 1024*1024*4 MB, blocksize of the LVM's
io.file.buffer.size: 4194304

__HDFS-SITE.XML__

# 1024*1024*4*32 MB, 32 times the blocksize of the LVM's
dfs.block.size: 134217728
# Only 5 DN's, but this shouldn't hurt
dfs.namenode.handler.count: 40
# This got rid of the occasional "Could not obtain block"'s
dfs.datanode.max.xcievers: 4096

__MAPRED-SITE.XML__

mapred.tasktracker.map.tasks.maximum: 4
mapred.tasktracker.reduce.tasks.maximum: 4
mapred.child.java.opts: -Xmx2560m
mapreduce.reduce.shuffle.parallelcopies: 20
mapreduce.map.java.opts: -Xmx512m
mapreduce.reduce.java.opts: -Xmx512m
# Compression codecs are configured and seem to work fine
mapred.compress.map.output: true
mapred.map.output.compression.codec: com.hadoop.compression.lzo.LzoCodec

Reply via email to