Mat, Perhaps you can simply set a percentage of failure toleration for your job.
Doable via http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxMapTaskFailuresPercent(int) and http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxReduceTaskFailuresPercent(int) If you set it to 10%, your job still passes if 10% of total Map or Reduce tasks failed. I think this fits your use-case. On 04-Dec-2011, at 4:05 AM, Mat Kelcey wrote: > Hi folks, > > I have a Hadoop 0.20.2 map only job with thousands of inputs tasks; > I'm using the org.apache.nutch.tools.arc.ArcInputFormat input format > so each task corresponds to a single file in HDFS > > Most of the way into the job it hits a task that causes the input > format to OOM. After 4 attempts it fails the job. > Now this is obviously not great but for the purpose of my job I'd be > happy to just throw this input file away, it's only one of thousands > and I don't need exact results. > > The trouble is I can't work out where what file this task corresponds to? > > The closest I can find is that the job history file lists a STATE_STRING > ( eg > STATE_STRING="hdfs://ip-10-115-29-44\.ec2\.internal:9000/user/hadoop/arc_files\.aa/2009/09/17/0/1253240925734_0\.arc\.gz:0+100425468" > ) > > but this is _only_ for the successfully completed ones, for the failed > one I'm actually interested in there is nothing > MapAttempt TASK_TYPE="MAP" TASKID="task_201112030459_0011_m_004130" > TASK_ATTEMPT_ID="attempt_201112030459_0011_m_004130_0" > TASK_STATUS="FAILED" FINISH_TIME="1322901661261" > HOSTNAME="ip-10-218-57-227\.ec2\.internal" ERROR="Error: null" . > > I grepped through all the hadoop logs and couldn't find anything that > relates this task to the files in it's split > Any ideas where this info might be recorded? > > Cheers, > Mat