[jira] Issue Comment Edited: (HADOOP-2393) TaskTracker locks up removing job files within a synchronized method

Devaraj Das (JIRA) Wed, 23 Apr 2008 10:35:09 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12591691#action_12591691
 ]


devaraj edited comment on HADOOP-2393 at 4/23/08 10:30 AM:
---------------------------------------------------------------

How about doing this: create a daemon thread that will just do deletes for 
paths queued up in its queue. So any call to deleteLocalFiles (this is called 
from two places in the TaskTracker) esentially just inserts the path to be 
deleted in this queue. This will ensure minimal changes in the flow of 
execution and keep it simple.

      was (Author: devaraj):
    How about doing this: create a daemon thread that will just do deletes for 
paths queued up in its queue. So any call to deleteLocalFiles esentially just 
inserts the path to be deleted in this queue. This will ensure minimal changes 
in the flow of execution and keep it simple.
  
> TaskTracker locks up removing job files within a synchronized method 
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-2393
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2393
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.14.4
>         Environment: 0.13.1, quad-code x86-64, FC-linux. -xmx2048
> ipc.client.timeout = 10000
>            Reporter: Joydeep Sen Sarma
>            Priority: Critical
>
> we have some bad jobs where the reduces are getting stalled (for unknown 
> reason). The task tracker kills these processes from time to time.
> Everytime one of these events happens - other (healthy) map tasks in the same 
> node are also killed. Looking at the logs and code up to 0.14.3 - it seems 
> like the child tasks pings to the task tracker are timed out and the child 
> task self-terminates.
> tasktracker log:
> // notice the good 10+ second gap in logs on otherwise busy node:
> 2007-12-10 09:26:53,047 INFO org.apache.hadoop.mapred.TaskRunner: 
> task_0120_r_000001_47 done; removing files.                                   
>     
> 2007-12-10 09:27:26,878 INFO org.apache.hadoop.mapred.TaskRunner: 
> task_0120_m_000618_0 done; removing files.                                    
>     
> 2007-12-10 09:27:26,883 INFO org.apache.hadoop.ipc.Server: Process Thread 
> Dump: Discarding call ping(task_0149_m_000007_0) from 10.16.158.113:43941 
> 24 active threads                                                             
>                                                                       
> ... huge stack trace dump in logfile ...
> something was going on at this time which caused to the tasktracker to 
> essentially stall. all the pings are discarded. after stack trace dump:
> 2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
>  discarded for being too old (21380)                                          
>                                                                       
> 2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
>  discarded for being too old (21380)                                          
>                                                                       
> 2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
>  discarded for being too old (10367)                                          
>                                                                       
> 2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
>  discarded for being too old (10360)                                          
>                                                                       
> 2007-12-10 09:27:26,982 WARN org.apache.hadoop.mapred.TaskRunner: 
> task_0149_m_000002_1 Child Error     
> looking at code, failure of client to ping causes termination:
>               else {                                                          
>                                                                       
>                 // send ping                                                  
>                                                                       
>                 taskFound = umbilical.ping(taskId);                           
>                                                                       
>               }                                                               
>                                                                       
> ...
>             catch (Throwable t) {                                             
>                                                                       
>               LOG.info("Communication exception: " + 
> StringUtils.stringifyException(t));                                           
>                  
>               remainingRetries -=1;                                           
>                                                                       
>               if (remainingRetries == 0) {                                    
>                                                                       
>                 ReflectionUtils.logThreadInfo(LOG, "Communication exception", 
> 0);                                                                   
>                 LOG.warn("Last retry, killing "+taskId);                      
>                                                                       
>                 System.exit(65);                                              
>                                                                       
> exit code is 65 as reported by task tracker.
> i don't see an option to turn off stack trace dump (which could be a likely 
> cause) - and i would hate to bump up timeout because of this. Crap.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-2393) TaskTracker locks up removing job files within a synchronized method

Reply via email to