[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13815261#comment-13815261
 ] 

Chris Nauroth commented on MAPREDUCE-5606:
------------------------------------------

I've seen this happen in more recent 1.x versions too.  In my case, it happened 
while writing job history files to HDFS.  The problem is that this occurs while 
holding a global lock (inside a synchronized method of the {{JobTracker}} 
object).  This prevents the JT from getting other useful work done, like 
accepting new job submissions or displaying the web UI.  You might be able to 
confirm this by inspecting a thread dump of your JT process while this is 
happening.

If your investigation shows the same root cause (blocked writing history files 
to HDFS), then you can disable this and instead only write history to the local 
file system.  If the configuration parameter hadoop.job.history.location is set 
to a location on HDFS, then remove this.  (It will default to the standard 
Hadoop log directory on the local file system.)

There is also hadoop.job.history.user.location.  If unspecified, this will 
default to writing per-job history files in each job's output directory in 
HDFS.  You can disable these files by setting the value to none, like this:

{code}
<property>
  <name>hadoop.job.history.user.location</name>
  <final>true</final>
  <value>none</value>
</property>
{code}

To fix this issue completely, we'd need to move the logic for writing history 
outside of the {{JobTracker}} monitor.  Really any kind of I/O performed while 
holding a global lock is problematic due to the risk of failure.

> JobTracker blocked for DFSClient: Failed recovery attempt
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-5606
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5606
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker
>    Affects Versions: 1.0.3
>         Environment: centos 5.8  jdk 1.7 
>            Reporter: firegun
>            Assignee: firegun
>            Priority: Critical
>
> when a  datanode was crash,the server can  ping ok,but can not  call rpc ,and 
> also can not ssh login. and then jobTracker may be request a block on this 
> datanode.
> it will happened ,the  JobTracker can not work,the webUI is also 
> unwork,hadoop job -list also unwork,the jobTracker logs no other info .
> and then we need to restart the datanode.
> then jobTraker can work too,but the taskTracker num come to zero,
> we need run : hadoop mradmin -refreshNodes
> then the JobTracker begin to add taskTraker ,but is very slowly.
> this problem occur 5time  in 2weeks.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to