Good idea, I guess I'll just have to run it up again though.
Thanks

On 4 December 2011 01:12, Bejoy Ks <bejoy.had...@gmail.com> wrote:
> Hi Mat
>         I'm not sure of an implicit mechanism in hadoop that logs the input
> splits(file names) each mapper is processing. To analyze that you may have
> to do some custom logging. Just log the input file name on the start of map
> method. The full file path in hdfs can be obtained from the input Split as
> follows
>
> //get the file split being processed
> FileSplit filsp = (FileSplit)context.getInputSplit();
> //get the full path of the file being processed
> log.debug(filsp.getPath());
>
> This works with new map reduce API. In old map reduce API you can get the
> information from JobConf job as
> job.get("map.input.file");
> This line of code you can include in your configure method in case of old
> API.
>
> Hope it helps!...
>
> Regards
> Bejoy.K.S
>
>
> On Sun, Dec 4, 2011 at 4:05 AM, Mat Kelcey <matthew.kel...@gmail.com> wrote:
>>
>> Hi folks,
>>
>> I have a Hadoop 0.20.2 map only job with thousands of inputs tasks;
>> I'm using the org.apache.nutch.tools.arc.ArcInputFormat input format
>> so each task corresponds to a single file in HDFS
>>
>> Most of the way into the job it hits a task that causes the input
>> format to OOM. After 4 attempts it fails the job.
>> Now this is obviously not great but for the purpose of my job I'd be
>> happy to just throw this input file away, it's only one of thousands
>> and I don't need exact results.
>>
>> The trouble is I can't work out where what file this task corresponds to?
>>
>> The closest I can find is that the job history file lists a STATE_STRING
>> ( eg
>> STATE_STRING="hdfs://ip-10-115-29-44\.ec2\.internal:9000/user/hadoop/arc_files\.aa/2009/09/17/0/1253240925734_0\.arc\.gz:0+100425468"
>> )
>>
>> but this is _only_ for the successfully completed ones, for the failed
>> one I'm actually interested in there is nothing
>> MapAttempt TASK_TYPE="MAP" TASKID="task_201112030459_0011_m_004130"
>> TASK_ATTEMPT_ID="attempt_201112030459_0011_m_004130_0"
>> TASK_STATUS="FAILED" FINISH_TIME="1322901661261"
>> HOSTNAME="ip-10-218-57-227\.ec2\.internal" ERROR="Error: null" .
>>
>> I grepped through all the hadoop logs and couldn't find anything that
>> relates this task to the files in it's split
>> Any ideas where this info might be recorded?
>>
>> Cheers,
>> Mat
>
>

Reply via email to