Error: Too Many Fetch Failures

Ellis H. Wilson III Tue, 19 Jun 2012 07:28:08 -0700

Hi all,

This is my first email to the list, so feel free to be candid in yourcomplaints if I'm doing something canonically uncouth in my requests forassistance.

I'm using Hadoop 0.23 on 50 machines, each connected with gigabitethernet and each having solely a single hard disk. I am getting thefollowing error repeatably for the TeraSort benchmark. TeraGen runswithout error, but TeraSort runs predictably until this error pops upbetween 64% and 70% completion. This doesn't occur for every executionof the benchmark, as about one out of four times that I run thebenchmark it does run to completion (TeraValidate included).


Error at the CLI:
"12/06/10 11:17:50 INFO mapreduce.Job:  map 100% reduce 64%

12/06/10 11:20:45 INFO mapreduce.Job: Task Id :attempt_1339331790635_0002_m_004337_0, Status : FAILED

Container killed by the ApplicationMaster.

Too Many fetch failures.Failing the attempt

12/06/10 11:21:45 WARN mapreduce.Job: Error reading task output Readtimed out12/06/10 11:23:06 WARN mapreduce.Job: Error reading task output Readtimed out12/06/10 11:23:07 INFO mapreduce.Job: Task Id :attempt_1339331790635_0002_m_004613_0, Status : FAILED"

I am still warming up to Yarn, so am not deft yet at getting all thelogfiles I need, but under more careful inspection of the logs I couldfind and the machines themselves it seems like this is related to manynumbers of sockets being up concurrently, which at some point preventsfurther connections being made from the requesting Reduce to the Mapwhich has the data desired, leading the Reducer to believe there is someerror in getting that data. These errors continue to be spewed onceabout every 3 minutes for about 45 minutes until at last the job diescompletely.

I have attached my -site.xml files so that a better idea of myconfiguration is evident, and any and all suggestions or queries formore info are welcome. Things I have tried already, per the document Ifound athttp://www.slideshare.net/cloudera/hadoop-troubleshooting-101-kate-ting-cloudera:

mapred.reduce.slowstart.completed.maps = 0.80 (seems to help, but ithurts performance as I'm the only person running on the cluster, and itdoesn't cure the problem -- just increases chance of completion from 1/4to 1/3 at best)

tasktracker.http.threads = 80 (default is 40 I think, and I've triedthis and even much higher values to no avail)


Best, and Thanks in Advance,

ellis

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->
<configuration>
<!-- PROPERTIES FOR ALL TYPES -->
	<property>
		<name>io.file.buffer.size</name>
		<value>131072</value>
		<final>true</final>
	</property>
	<property>
		<name>io.sort.mb</name>
		<value>256</value>
		<final>true</final>
	</property>
	<property>
		<name>hadoop.tmp.dir</name>
		<value>/mnt/local/hadoop/tmp</value>
		<final>true</final>
	</property>
	<property>
		<name>io.sort.factor</name>
		<value>20</value>
		<final>true</final>
	</property>
	<property>
		<name>fs.local.block.size</name>
		<value>33554432</value>
		<final>true</final>
	</property>
	<property>
		<name>fs.default.name</name>
		<value>hdfs://pool103:9000</value>
		<final>true</final>
	</property>
</configuration>

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->
<configuration>
	<property>
		<name>dfs.replication</name>
		<value>2</value>
		<final>true</final>
	</property>
	<property>
		<name>dfs.permissions</name>
		<value>false</value>
	</property>
	<property>
		<name>dfs.namenode.name.dir</name>
		<value>/mnt/local/hadoop/name</value>
		<final>true</final>
	</property>
	<property>
		<name>dfs.datanode.data.dir</name>
		<value>/mnt/local/hadoop/data</value>
		<final>true</final>
	</property>
	<property>
		<name>dfs.namenode.handler.count</name>
		<value>20</value>
		<final>true</final>
	</property>
	<property>
		<name>dfs.datanode.handler.count</name>
		<value>10</value>
		<final>true</final>
	</property>
</configuration>

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->
<configuration>
	<property>
		<name>mapreduce.framework.name</name>
		<value>yarn</value>
	</property>
<!--
	<property>
		<name>mapreduce.map.memory.mb</name>
		<value>1024</value>
		<final>true</final>
	</property>
	<property>
		<name>mapreduce.reduce.java.opts</name>
		<value>1536</value>
		<final>true</final>
	</property>
-->
	<property>
		<name>mapreduce.map.java.opts</name>
		<value>-Xmx512M</value>
		<final>true</final>
	</property>
	<property>
		<name>mapreduce.reduce.java.opts</name>
		<value>-Xmx512M</value>
		<final>true</final>
	</property>
	<property>
		<name>mapreduce.job.reuse.jvm.num.tasks</name>
		<value>-1</value>
		<final>true</final>
	</property>
</configuration>

<?xml version="1.0"?>
<configuration>

	<property>
		<name>yarn.nodemanager.aux-services</name>
		<value>mapreduce.shuffle</value>
	</property>
	<property>
		<name>yarn.nodemanager.resource.memory-mb</name>
		<value>3072</value>
	</property>
	<property>
		<name>yarn.nodemanager.local-dirs</name>
		<value>/mnt/local/hadoop/nmlocal</value>
	</property>
	<property>
		<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
		<value>org.apache.hadoop.mapred.ShuffleHandler</value>
	</property>
	<property>
		<name>yarn.resourcemanager.resource-tracker.address</name>
		<value>pool103:8025</value>
		<description>host is the hostname of the resource manager and 
		port is the port on which the NodeManagers contact the Resource Manager.
		</description>
	</property>
	<property>
		<name>yarn.resourcemanager.scheduler.address</name>
		<value>pool103:8030</value>
		<description>host is the hostname of the resourcemanager and port is the port
		on which the Applications in the cluster talk to the Resource Manager.
		</description>
	</property>
	<property>
		<name>yarn.resourcemanager.scheduler.class</name>
		<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
		<description>In case you do not want to use the default scheduler</description>
	</property>
	<property>
		<name>yarn.resourcemanager.address</name>
		<value>pool103</value>
		<description>the host is the hostname of the ResourceManager and the port is the port on
		which the clients can talk to the Resource Manager. </description>
	</property>

</configuration>

Error: Too Many Fetch Failures

Reply via email to