[ 
https://issues.apache.org/jira/browse/HADOOP-5286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675240#action_12675240
 ] 

Hemanth Yamijala commented on HADOOP-5286:
------------------------------------------

>From what I can see in the M/R code, the call to initTasks (done from the 
>JobInitThread), is catching Throwable, logging it as an error, failing the job 
>and going to the next job. I did not see the error message in the log file. 
>There is no other exception handling in between. So if an exception was 
>raised, the call would have been aborted. So, I believe the call to initTasks 
>never returned.

The reason why there are multiple reads on the split file is because a single 
split file is typically across the same block (given the type of jobs we are 
running). Hence when M/R reads multiple fields in the split file, it trying to 
read the block multiple times (it looks like). The really important thing here 
is the time difference between the two logs here:

2009-02-19 10:26:00,504 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain 
block blk_2044238107768440002_840946 from any node:  java.io.IOException: No 
live nodes contain current block
2009-02-19 11:29:25,054 INFO org.apache.hadoop.mapred.JobInProgress: Input size 
for job job_200902190419_0419 = 635236206196

That's an hour - there are no logs in between on the M/R system. So, I believe 
the rest of the split file was being read slowly, but ultimately succeeding, 
because the job finally began to run.

One question. If the data node is so slow in reading, then is there a way to 
make the calls timeout sooner by some configuration, and thus cause the DFS to 
throw an exception, that would abort the job, and let other jobs proceed ? We 
could try out with such a configuration and see if that helps.

> DFS client blocked for a long time reading blocks of a file on the JobTracker
> -----------------------------------------------------------------------------
>
>                 Key: HADOOP-5286
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5286
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.20.0
>            Reporter: Hemanth Yamijala
>         Attachments: jt-log-for-blocked-reads.txt
>
>
> On a large cluster, we've observed that DFS client was blocked on reading a 
> block of a file for almost 1 and half hours. The file was being read by the 
> JobTracker of the cluster, and was a split file of a job. On the NameNode 
> logs, we observed that the block had a message as follows:
> Inconsistent size for block blk_2044238107768440002_840946 reported from 
> <ip>:<port> current size is 195072 reported size is 1318567
> Details follow.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to