Good idea, I guess I'll just have to run it up again though. Thanks
On 4 December 2011 01:12, Bejoy Ks <bejoy.had...@gmail.com> wrote: > Hi Mat > I'm not sure of an implicit mechanism in hadoop that logs the input > splits(file names) each mapper is processing. To analyze that you may have > to do some custom logging. Just log the input file name on the start of map > method. The full file path in hdfs can be obtained from the input Split as > follows > > //get the file split being processed > FileSplit filsp = (FileSplit)context.getInputSplit(); > //get the full path of the file being processed > log.debug(filsp.getPath()); > > This works with new map reduce API. In old map reduce API you can get the > information from JobConf job as > job.get("map.input.file"); > This line of code you can include in your configure method in case of old > API. > > Hope it helps!... > > Regards > Bejoy.K.S > > > On Sun, Dec 4, 2011 at 4:05 AM, Mat Kelcey <matthew.kel...@gmail.com> wrote: >> >> Hi folks, >> >> I have a Hadoop 0.20.2 map only job with thousands of inputs tasks; >> I'm using the org.apache.nutch.tools.arc.ArcInputFormat input format >> so each task corresponds to a single file in HDFS >> >> Most of the way into the job it hits a task that causes the input >> format to OOM. After 4 attempts it fails the job. >> Now this is obviously not great but for the purpose of my job I'd be >> happy to just throw this input file away, it's only one of thousands >> and I don't need exact results. >> >> The trouble is I can't work out where what file this task corresponds to? >> >> The closest I can find is that the job history file lists a STATE_STRING >> ( eg >> STATE_STRING="hdfs://ip-10-115-29-44\.ec2\.internal:9000/user/hadoop/arc_files\.aa/2009/09/17/0/1253240925734_0\.arc\.gz:0+100425468" >> ) >> >> but this is _only_ for the successfully completed ones, for the failed >> one I'm actually interested in there is nothing >> MapAttempt TASK_TYPE="MAP" TASKID="task_201112030459_0011_m_004130" >> TASK_ATTEMPT_ID="attempt_201112030459_0011_m_004130_0" >> TASK_STATUS="FAILED" FINISH_TIME="1322901661261" >> HOSTNAME="ip-10-218-57-227\.ec2\.internal" ERROR="Error: null" . >> >> I grepped through all the hadoop logs and couldn't find anything that >> relates this task to the files in it's split >> Any ideas where this info might be recorded? >> >> Cheers, >> Mat > >