[
https://issues.apache.org/jira/browse/MAPREDUCE-5606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13815261#comment-13815261
]
Chris Nauroth commented on MAPREDUCE-5606:
------------------------------------------
I've seen this happen in more recent 1.x versions too. In my case, it happened
while writing job history files to HDFS. The problem is that this occurs while
holding a global lock (inside a synchronized method of the {{JobTracker}}
object). This prevents the JT from getting other useful work done, like
accepting new job submissions or displaying the web UI. You might be able to
confirm this by inspecting a thread dump of your JT process while this is
happening.
If your investigation shows the same root cause (blocked writing history files
to HDFS), then you can disable this and instead only write history to the local
file system. If the configuration parameter hadoop.job.history.location is set
to a location on HDFS, then remove this. (It will default to the standard
Hadoop log directory on the local file system.)
There is also hadoop.job.history.user.location. If unspecified, this will
default to writing per-job history files in each job's output directory in
HDFS. You can disable these files by setting the value to none, like this:
{code}
<property>
<name>hadoop.job.history.user.location</name>
<final>true</final>
<value>none</value>
</property>
{code}
To fix this issue completely, we'd need to move the logic for writing history
outside of the {{JobTracker}} monitor. Really any kind of I/O performed while
holding a global lock is problematic due to the risk of failure.
> JobTracker blocked for DFSClient: Failed recovery attempt
> ---------------------------------------------------------
>
> Key: MAPREDUCE-5606
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5606
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: jobtracker
> Affects Versions: 1.0.3
> Environment: centos 5.8 jdk 1.7
> Reporter: firegun
> Assignee: firegun
> Priority: Critical
>
> when a datanode was crash,the server can ping ok,but can not call rpc ,and
> also can not ssh login. and then jobTracker may be request a block on this
> datanode.
> it will happened ,the JobTracker can not work,the webUI is also
> unwork,hadoop job -list also unwork,the jobTracker logs no other info .
> and then we need to restart the datanode.
> then jobTraker can work too,but the taskTracker num come to zero,
> we need run : hadoop mradmin -refreshNodes
> then the JobTracker begin to add taskTraker ,but is very slowly.
> this problem occur 5time in 2weeks.
--
This message was sent by Atlassian JIRA
(v6.1#6144)