After making two changes to my Hadoop cluster I now see Fetch Failures
that I previously did not see when running the same MR programs.  The
two changes were:
a) moving from Hadoop 0.20.2 to 0.21.0, and
b) downsizing my cluster from 39 slave nodes to 5 slave nodes (plus 1
dedicated master node).
The failure mechanism seems to be consistent on many different MR
programs: Map reaches 100% and Reduce reaches 32-33% and then the errors
are reported.  Any ideas on how I can fix this problem?  Thanks.

For the record, each node has 2 quad-core Nehalems with 24 or 48 GB RAM
and 4 1TB hard drives.


10/10/28 08:51:11 INFO mapred.FileInputFormat: Total input paths to
process : 47
10/10/28 08:51:11 WARN conf.Configuration: mapred.map.tasks is
deprecated. Instead, use mapreduce.job.maps
10/10/28 08:51:11 INFO mapreduce.JobSubmitter: number of splits:54
10/10/28 08:51:11 INFO mapreduce.JobSubmitter: adding the following
namenodes' delegation tokens:null
10/10/28 08:51:12 INFO mapreduce.Job: Running job: job_201010280812_0004
10/10/28 08:51:13 INFO mapreduce.Job:  map 0% reduce 0%
10/10/28 08:51:20 INFO mapreduce.Job:  map 20% reduce 0%
10/10/28 08:51:21 INFO mapreduce.Job:  map 31% reduce 0%
10/10/28 08:51:22 INFO mapreduce.Job:  map 64% reduce 0%
10/10/28 08:51:23 INFO mapreduce.Job:  map 85% reduce 0%
10/10/28 08:51:24 INFO mapreduce.Job:  map 94% reduce 0%
10/10/28 08:51:25 INFO mapreduce.Job:  map 96% reduce 0%
10/10/28 08:51:27 INFO mapreduce.Job:  map 100% reduce 0%
10/10/28 08:51:29 INFO mapreduce.Job:  map 100% reduce 28%
10/10/28 08:51:32 INFO mapreduce.Job:  map 100% reduce 32%
10/10/28 08:52:04 INFO mapreduce.Job: Task Id :
attempt_201010280812_0004_m_000000_0, Status : FAILED
Too many fetch-failures
10/10/28 08:52:04 WARN mapreduce.Job: Error reading task
outputConnection refused
10/10/28 08:52:04 WARN mapreduce.Job: Error reading task
outputConnection refused
10/10/28 08:52:37 INFO mapreduce.Job: Task Id :
attempt_201010280812_0004_m_000001_0, Status : FAILED
Too many fetch-failures
10/10/28 08:52:37 WARN mapreduce.Job: Error reading task
outputConnection refused
10/10/28 08:52:37 WARN mapreduce.Job: Error reading task
outputConnection refused
10/10/28 08:52:50 INFO mapreduce.Job:  map 100% reduce 100%
10/10/28 08:52:52 INFO mapreduce.Job: Job complete:
job_201010280812_0004
10/10/28 08:52:52 INFO mapreduce.Job: Counters: 34
        FileInputFormatCounters
                BYTES_READ=9370328
        FileSystemCounters
                FILE_BYTES_READ=1670448
                FILE_BYTES_WRITTEN=3342942
                HDFS_BYTES_READ=9766272
                HDFS_BYTES_WRITTEN=670
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=5
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        Job Counters 
                Data-local map tasks=42
                Total time spent by all maps waiting after reserving
slots (ms)=0
                Total time spent by all reduces waiting after reserving
slots (ms)=0
                Rack-local map tasks=14
                SLOTS_MILLIS_MAPS=160676
                SLOTS_MILLIS_REDUCES=88121
                Launched map tasks=56
                Launched reduce tasks=2
        Map-Reduce Framework
                Combine input records=193132
                Combine output records=54
                Failed Shuffles=5
                GC time elapsed (ms)=2293
                Map input records=258047
                Map output bytes=4426979
                Map output records=193132
                Merged Map outputs=54
                Reduce input groups=47
                Reduce input records=54
                Reduce output records=8
                Reduce shuffle bytes=1670766
                Shuffled Maps =54
                Spilled Records=108
                SPLIT_RAW_BYTES=5540

...

Task Logs: 'attempt_201010280812_0004_m_000000_1'



stdout logs


stderr logs


syslog logs
2010-10-28 04:58:18,855 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=MAP, sessionId=
2010-10-28 04:58:18,861 WARN org.apache.hadoop.conf.Configuration:
user.name is deprecated. Instead, use mapreduce.job.user.name
2010-10-28 04:58:18,923 WARN org.apache.hadoop.conf.Configuration:
mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
2010-10-28 04:58:18,991 INFO org.apache.hadoop.mapred.MapTask:
numReduceTasks: 1
2010-10-28 04:58:19,132 INFO org.apache.hadoop.mapred.MapTask: (EQUATOR)
0 kvi 65535996(262143984)
2010-10-28 04:58:19,132 INFO org.apache.hadoop.mapred.MapTask:
mapreduce.task.io.sort.mb: 250
2010-10-28 04:58:19,132 INFO org.apache.hadoop.mapred.MapTask: soft
limit at 209715200
2010-10-28 04:58:19,132 INFO org.apache.hadoop.mapred.MapTask: bufstart
= 0; bufvoid = 262144000
2010-10-28 04:58:19,132 INFO org.apache.hadoop.mapred.MapTask: kvstart =
65535996; length = 16384000
2010-10-28 04:58:19,373 INFO org.apache.hadoop.mapred.MapTask: Starting
flush of map output
2010-10-28 04:58:19,373 INFO org.apache.hadoop.mapred.MapTask: Spilling
map output
2010-10-28 04:58:19,373 INFO org.apache.hadoop.mapred.MapTask: bufstart
= 0; bufend = 165350; bufvoid = 262144000
2010-10-28 04:58:19,373 INFO org.apache.hadoop.mapred.MapTask: kvstart =
65535996(262143984); kvend = 65506324(262025296); length =
29673/16384000
2010-10-28 04:58:19,768 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 0
2010-10-28 04:58:19,772 INFO org.apache.hadoop.mapred.Task:
Task:attempt_201010280812_0004_m_000000_1 is done. And is in the process
of commiting
2010-10-28 04:58:19,780 INFO org.apache.hadoop.mapred.Task: Task
'attempt_201010280812_0004_m_000000_1' done.

Reply via email to