[
https://issues.apache.org/jira/browse/HADOOP-5286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675240#action_12675240
]
Hemanth Yamijala commented on HADOOP-5286:
------------------------------------------
>From what I can see in the M/R code, the call to initTasks (done from the
>JobInitThread), is catching Throwable, logging it as an error, failing the job
>and going to the next job. I did not see the error message in the log file.
>There is no other exception handling in between. So if an exception was
>raised, the call would have been aborted. So, I believe the call to initTasks
>never returned.
The reason why there are multiple reads on the split file is because a single
split file is typically across the same block (given the type of jobs we are
running). Hence when M/R reads multiple fields in the split file, it trying to
read the block multiple times (it looks like). The really important thing here
is the time difference between the two logs here:
2009-02-19 10:26:00,504 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain
block blk_2044238107768440002_840946 from any node: java.io.IOException: No
live nodes contain current block
2009-02-19 11:29:25,054 INFO org.apache.hadoop.mapred.JobInProgress: Input size
for job job_200902190419_0419 = 635236206196
That's an hour - there are no logs in between on the M/R system. So, I believe
the rest of the split file was being read slowly, but ultimately succeeding,
because the job finally began to run.
One question. If the data node is so slow in reading, then is there a way to
make the calls timeout sooner by some configuration, and thus cause the DFS to
throw an exception, that would abort the job, and let other jobs proceed ? We
could try out with such a configuration and see if that helps.
> DFS client blocked for a long time reading blocks of a file on the JobTracker
> -----------------------------------------------------------------------------
>
> Key: HADOOP-5286
> URL: https://issues.apache.org/jira/browse/HADOOP-5286
> Project: Hadoop Core
> Issue Type: Bug
> Components: dfs
> Affects Versions: 0.20.0
> Reporter: Hemanth Yamijala
> Attachments: jt-log-for-blocked-reads.txt
>
>
> On a large cluster, we've observed that DFS client was blocked on reading a
> block of a file for almost 1 and half hours. The file was being read by the
> JobTracker of the cluster, and was a split file of a job. On the NameNode
> logs, we observed that the block had a message as follows:
> Inconsistent size for block blk_2044238107768440002_840946 reported from
> <ip>:<port> current size is 195072 reported size is 1318567
> Details follow.
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.