[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Mitic updated MAPREDUCE-5512:
----------------------------------

    Attachment: MAPREDUCE-5512.branch-1.patch

Attaching the patch.

My proposal for the fix is to make the dist cache cleanup thread a daemon. 
Based on the scan thru the code I think it should be safe to make this change. 

For the unittest, I added a test that validates the list of non-daemon threads. 
This is a more general test case but I think it will serve well to protect the 
codebase against regressions in this area. I was not able to come up with a 
nice way to simulate the condition from this bug without adding a test hook in 
the production code, so I moved away from this approach (we would have to start 
JT, stop JT, start JT again which would tell TT to reinit, and then stop JT, 
but last JT stop must have the right timing and run before TT#initialize() 
executes).

Slightly orthogonally, looking at the list of threads I had to whitelist, there 
might be some other candidate threads that could be made daemons, but I'd 
prefer not to make this change in the context of this Jira.

> TaskTracker hung after failed reconnect to the JobTracker
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-5512
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5512
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 1.3.0
>            Reporter: Ivan Mitic
>            Assignee: Ivan Mitic
>         Attachments: hadoop-tasktracker-RD00155DD09100.log, 
> MAPREDUCE-5512.branch-1.patch, tt_Hung.txt
>
>
> TaskTracker hung after failed reconnect to the JobTracker. 
> This is the problematic piece of code:
> {code}
>     this.distributedCacheManager = new TrackerDistributedCacheManager(
>         this.fConf, taskController);
>     this.distributedCacheManager.startCleanupThread();
>     
>     this.jobClient = (InterTrackerProtocol) 
>     UserGroupInformation.getLoginUser().doAs(
>         new PrivilegedExceptionAction<Object>() {
>       public Object run() throws IOException {
>         return RPC.waitForProxy(InterTrackerProtocol.class,
>             InterTrackerProtocol.versionID,
>             jobTrackAddr, fConf);
>       }
>     });
> {code}
> In case RPC.waitForProxy() throws, TrackerDistributedCacheManager cleanup 
> thread will never be stopped, and given that it is a non daemon thread it 
> will keep TT up forever.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to