After making two changes to my Hadoop cluster I now see Fetch Failures
that I previously did not see when running the same MR programs. The
two changes were:
a) moving from Hadoop 0.20.2 to 0.21.0, and
b) downsizing my cluster from 39 slave nodes to 5 slave nodes (plus 1
dedicated master node).
The failure mechanism seems to be consistent on many different MR
programs: Map reaches 100% and Reduce reaches 32-33% and then the errors
are reported. Any ideas on how I can fix this problem? Thanks.
For the record, each node has 2 quad-core Nehalems with 24 or 48 GB RAM
and 4 1TB hard drives.
10/10/28 08:51:11 INFO mapred.FileInputFormat: Total input paths to
process : 47
10/10/28 08:51:11 WARN conf.Configuration: mapred.map.tasks is
deprecated. Instead, use mapreduce.job.maps
10/10/28 08:51:11 INFO mapreduce.JobSubmitter: number of splits:54
10/10/28 08:51:11 INFO mapreduce.JobSubmitter: adding the following
namenodes' delegation tokens:null
10/10/28 08:51:12 INFO mapreduce.Job: Running job: job_201010280812_0004
10/10/28 08:51:13 INFO mapreduce.Job: map 0% reduce 0%
10/10/28 08:51:20 INFO mapreduce.Job: map 20% reduce 0%
10/10/28 08:51:21 INFO mapreduce.Job: map 31% reduce 0%
10/10/28 08:51:22 INFO mapreduce.Job: map 64% reduce 0%
10/10/28 08:51:23 INFO mapreduce.Job: map 85% reduce 0%
10/10/28 08:51:24 INFO mapreduce.Job: map 94% reduce 0%
10/10/28 08:51:25 INFO mapreduce.Job: map 96% reduce 0%
10/10/28 08:51:27 INFO mapreduce.Job: map 100% reduce 0%
10/10/28 08:51:29 INFO mapreduce.Job: map 100% reduce 28%
10/10/28 08:51:32 INFO mapreduce.Job: map 100% reduce 32%
10/10/28 08:52:04 INFO mapreduce.Job: Task Id :
attempt_201010280812_0004_m_000000_0, Status : FAILED
Too many fetch-failures
10/10/28 08:52:04 WARN mapreduce.Job: Error reading task
outputConnection refused
10/10/28 08:52:04 WARN mapreduce.Job: Error reading task
outputConnection refused
10/10/28 08:52:37 INFO mapreduce.Job: Task Id :
attempt_201010280812_0004_m_000001_0, Status : FAILED
Too many fetch-failures
10/10/28 08:52:37 WARN mapreduce.Job: Error reading task
outputConnection refused
10/10/28 08:52:37 WARN mapreduce.Job: Error reading task
outputConnection refused
10/10/28 08:52:50 INFO mapreduce.Job: map 100% reduce 100%
10/10/28 08:52:52 INFO mapreduce.Job: Job complete:
job_201010280812_0004
10/10/28 08:52:52 INFO mapreduce.Job: Counters: 34
FileInputFormatCounters
BYTES_READ=9370328
FileSystemCounters
FILE_BYTES_READ=1670448
FILE_BYTES_WRITTEN=3342942
HDFS_BYTES_READ=9766272
HDFS_BYTES_WRITTEN=670
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=5
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
Job Counters
Data-local map tasks=42
Total time spent by all maps waiting after reserving
slots (ms)=0
Total time spent by all reduces waiting after reserving
slots (ms)=0
Rack-local map tasks=14
SLOTS_MILLIS_MAPS=160676
SLOTS_MILLIS_REDUCES=88121
Launched map tasks=56
Launched reduce tasks=2
Map-Reduce Framework
Combine input records=193132
Combine output records=54
Failed Shuffles=5
GC time elapsed (ms)=2293
Map input records=258047
Map output bytes=4426979
Map output records=193132
Merged Map outputs=54
Reduce input groups=47
Reduce input records=54
Reduce output records=8
Reduce shuffle bytes=1670766
Shuffled Maps =54
Spilled Records=108
SPLIT_RAW_BYTES=5540
...
Task Logs: 'attempt_201010280812_0004_m_000000_1'
stdout logs
stderr logs
syslog logs
2010-10-28 04:58:18,855 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=MAP, sessionId=
2010-10-28 04:58:18,861 WARN org.apache.hadoop.conf.Configuration:
user.name is deprecated. Instead, use mapreduce.job.user.name
2010-10-28 04:58:18,923 WARN org.apache.hadoop.conf.Configuration:
mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
2010-10-28 04:58:18,991 INFO org.apache.hadoop.mapred.MapTask:
numReduceTasks: 1
2010-10-28 04:58:19,132 INFO org.apache.hadoop.mapred.MapTask: (EQUATOR)
0 kvi 65535996(262143984)
2010-10-28 04:58:19,132 INFO org.apache.hadoop.mapred.MapTask:
mapreduce.task.io.sort.mb: 250
2010-10-28 04:58:19,132 INFO org.apache.hadoop.mapred.MapTask: soft
limit at 209715200
2010-10-28 04:58:19,132 INFO org.apache.hadoop.mapred.MapTask: bufstart
= 0; bufvoid = 262144000
2010-10-28 04:58:19,132 INFO org.apache.hadoop.mapred.MapTask: kvstart =
65535996; length = 16384000
2010-10-28 04:58:19,373 INFO org.apache.hadoop.mapred.MapTask: Starting
flush of map output
2010-10-28 04:58:19,373 INFO org.apache.hadoop.mapred.MapTask: Spilling
map output
2010-10-28 04:58:19,373 INFO org.apache.hadoop.mapred.MapTask: bufstart
= 0; bufend = 165350; bufvoid = 262144000
2010-10-28 04:58:19,373 INFO org.apache.hadoop.mapred.MapTask: kvstart =
65535996(262143984); kvend = 65506324(262025296); length =
29673/16384000
2010-10-28 04:58:19,768 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 0
2010-10-28 04:58:19,772 INFO org.apache.hadoop.mapred.Task:
Task:attempt_201010280812_0004_m_000000_1 is done. And is in the process
of commiting
2010-10-28 04:58:19,780 INFO org.apache.hadoop.mapred.Task: Task
'attempt_201010280812_0004_m_000000_1' done.