[ 
https://issues.apache.org/jira/browse/HADOOP-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joydeep Sen Sarma updated HADOOP-2393:
--------------------------------------

    Summary: TaskTracker locks up removing job files within a synchronized 
method   (was: healthy tasks dying because unable to ping tasktracker)

after the latest burst of problems - i am reasonably sure that the underlying 
cause for these timeouts are expensive synchronized methods in the taskTracker:

  private synchronized void purgeJob(KillJobAction action) throws IOException {
 ...
          fConf.deleteLocalFiles(SUBDIR + Path.SEPARATOR + JOBCACHE +
                                 Path.SEPARATOR +  rjob.getJobId());
...
}
  public synchronized boolean ping(String taskid) throws IOException {

every time the task tracker gets a kill job action - it's unpingable for a fair 
bit of time. All the childs error out. I have plenty of logs to bear this out. 
For example, see silent gap in log and immediate child errors after 
killjobaction:

2008-01-25 11:44:24,399 INFO org.apache.hadoop.mapred.TaskTracker: 
task_200801140959_4200_m_000001_0 0.88915026% 
hdfs://hadoop001.sf2p.facebook.com:9000/use
r/facebook/warehouse/ad_imps_annotated/ds=2008-01-06/part-00000:67108864+67108864
2008-01-25 11:44:24,421 INFO org.apache.hadoop.mapred.TaskTracker: Received 
'KillJobAction' for job: job_200801140959_4161
2008-01-25 11:44:24,422 INFO org.apache.hadoop.mapred.TaskRunner: 
task_200801140959_4161_r_000023_1 done; removing files.
2008-01-25 11:44:26,531 INFO org.apache.hadoop.mapred.TaskRunner: 
task_200801140959_4161_m_000107_0 done; removing files.
2008-01-25 11:44:41,252 INFO org.apache.hadoop.mapred.TaskRunner: 
task_200801140959_4161_m_000164_0 done; removing files.
2008-01-25 11:44:53,044 INFO org.apache.hadoop.mapred.TaskRunner: 
task_200801140959_4161_r_000005_0 done; removing files.
2008-01-25 11:44:53,045 INFO org.apache.hadoop.mapred.TaskRunner: 
task_200801140959_4161_m_000027_0 done; removing files.
2008-01-25 11:44:55,642 WARN org.apache.hadoop.mapred.TaskRunner: 
task_200801140959_4200_r_000013_0 Child Error
java.io.IOException: Task process exit with nonzero status of 65.
        at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:402)


I think for now -the workaround is simple - i will just increase the ping retry 
count. but this seems pretty bad. we should not be doing expensive file io in a 
synchronized method. gotta queue it up and do it in a different thread. i 
checked the trunk - the code looks pretty much the same.

> TaskTracker locks up removing job files within a synchronized method 
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-2393
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2393
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.14.4
>         Environment: 0.13.1, quad-code x86-64, FC-linux. -xmx2048
> ipc.client.timeout = 10000
>            Reporter: Joydeep Sen Sarma
>            Priority: Critical
>
> we have some bad jobs where the reduces are getting stalled (for unknown 
> reason). The task tracker kills these processes from time to time.
> Everytime one of these events happens - other (healthy) map tasks in the same 
> node are also killed. Looking at the logs and code up to 0.14.3 - it seems 
> like the child tasks pings to the task tracker are timed out and the child 
> task self-terminates.
> tasktracker log:
> // notice the good 10+ second gap in logs on otherwise busy node:
> 2007-12-10 09:26:53,047 INFO org.apache.hadoop.mapred.TaskRunner: 
> task_0120_r_000001_47 done; removing files.                                   
>     
> 2007-12-10 09:27:26,878 INFO org.apache.hadoop.mapred.TaskRunner: 
> task_0120_m_000618_0 done; removing files.                                    
>     
> 2007-12-10 09:27:26,883 INFO org.apache.hadoop.ipc.Server: Process Thread 
> Dump: Discarding call ping(task_0149_m_000007_0) from 10.16.158.113:43941 
> 24 active threads                                                             
>                                                                       
> ... huge stack trace dump in logfile ...
> something was going on at this time which caused to the tasktracker to 
> essentially stall. all the pings are discarded. after stack trace dump:
> 2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
>  discarded for being too old (21380)                                          
>                                                                       
> 2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
>  discarded for being too old (21380)                                          
>                                                                       
> 2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
>  discarded for being too old (10367)                                          
>                                                                       
> 2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
>  discarded for being too old (10360)                                          
>                                                                       
> 2007-12-10 09:27:26,982 WARN org.apache.hadoop.mapred.TaskRunner: 
> task_0149_m_000002_1 Child Error     
> looking at code, failure of client to ping causes termination:
>               else {                                                          
>                                                                       
>                 // send ping                                                  
>                                                                       
>                 taskFound = umbilical.ping(taskId);                           
>                                                                       
>               }                                                               
>                                                                       
> ...
>             catch (Throwable t) {                                             
>                                                                       
>               LOG.info("Communication exception: " + 
> StringUtils.stringifyException(t));                                           
>                  
>               remainingRetries -=1;                                           
>                                                                       
>               if (remainingRetries == 0) {                                    
>                                                                       
>                 ReflectionUtils.logThreadInfo(LOG, "Communication exception", 
> 0);                                                                   
>                 LOG.warn("Last retry, killing "+taskId);                      
>                                                                       
>                 System.exit(65);                                              
>                                                                       
> exit code is 65 as reported by task tracker.
> i don't see an option to turn off stack trace dump (which could be a likely 
> cause) - and i would hate to bump up timeout because of this. Crap.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to