[ https://issues.apache.org/jira/browse/MAPREDUCE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796092#action_12796092 ]
Amar Kamat commented on MAPREDUCE-1342: --------------------------------------- Simply making _potentiallyFaultyTrackers_ a concurrent HashMap and removing the *synchronized* keyword might introduce more issues. I think the reason for synchronizing on _potentiallyFaultyTrackers_ was to perform some operations in an atomic manner. Have you checked if the semantics remain same after removing the synchronized keyword? I think making _potentiallyFaultyTrackers_ as concurrent HashMap is better but might be dangerous. One other way to avoid the deadlock would be by marking few non-private apis in JobTracker.FaultyTrackerInfo as synchronized. Mainly {code} JobTracker.FaultyTrackerInfo.incrementFaults // called via Heartbeat and testcases JobTracker.FaultyTrackerInfo.markTrackerHealthy // called via Heartbeat JobTracker.FaultyTrackerInfo.shouldAssignTasksToTracker // called via Heartbeat and testcases JobTracker.FaultyTrackerInfo.isBlacklisted // called in multiple cases .. need to check JobTracker.FaultyTrackerInfo.getFaultCount // called via Heartbeat and testcases JobTracker.FaultyTrackerInfo.getReasonForBlackListing // never used! JobTracker.FaultyTrackerInfo.setNodeHealthStatus // called via Heartbeat and testcases {code} So except JobTracker.FaultyTrackerInfo.isBlacklisted(), all the calls are centrally locked on JobTracker. Hence adding the synchronized keyword in the method signature wouldnt introduce any overhead. Need to check on JobTracker.FaultyTrackerInfo.isBlacklisted(). > Potential JT deadlock in faulty TT tracking > ------------------------------------------- > > Key: MAPREDUCE-1342 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1342 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: jobtracker > Affects Versions: 0.22.0 > Reporter: Todd Lipcon > Attachments: cycle0.png, mapreduce-1342-1.patch > > > JT$FaultyTrackersInfo.incrementFaults first locks potentiallyFaultyTrackers, > and then calls blackListTracker, which calls removeHostCapacity, which locks > JT.taskTrackers > On the other hand, JT.blacklistedTaskTrackers() locks taskTrackers, then > calls faultyTrackers.isBlacklisted() which goes on to lock > potentiallyFaultyTrackers. > I haven't produced such a deadlock, but the lock ordering here is inverted > and therefore could deadlock. > Not sure if this goes back to 0.21 or just in trunk. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.