[ 
https://issues.apache.org/jira/browse/HDFS-3843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439874#comment-13439874
 ] 

Robert Joseph Evans commented on HDFS-3843:
-------------------------------------------

MAPREDUCE-2494 introduced a new lock when releasing a dist cache entry that 
introduced this problem.  Thanks to Koji for finding and debugging this.

Essentially the heartbeat thread holds a lock on the TaskTracker object.
So does the job cleanup thread.  Which also holds a lock on the 
TrackerDistributedCacheMenager's big list lock (this is the lock that 
MAPREDUCE-2494 added in).
The thread that deletes things from the dist cache also grabs that big lock, 
and at the same time grabs locks in turn for every entry in the dist cache.
While an entry in the dist cache is being downloaded it also holds the lock for 
the dist cache entry.

So this can result in

Downloading thread holds dist cache lock which blocks the dist cache delete 
thread which holds the full dist cache map lock that blocks the job cleanup 
thread that holds that Task Tracker lock which blocks the heartbeat thread.  
This can be seen below.

I think it is probably best to change the DistCache entries' locks so that when 
we go to delete them if the lock is held we skip that entry instead of having 
it block. 

{noformat}
Here, tracing from the heartbeat thread.
1=================================================
"main" prio=10 tid=0x0875c400 nid=0x3fca waiting for monitor entry [0xf73e6000]
   java.lang.Thread.State: BLOCKED (on object monitor)
  at
org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1790)
  - waiting to lock <0xb4299248> (a org.apache.hadoop.mapred.TaskTracker)
  at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1653)
  at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:2503)
  at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3744)

Looking for lock <0xb4299248> 
2=================================================
"taskCleanup" daemon prio=10 tid=0x0949ac00 nid=0x405c waiting for monitor
entry [0xadead000]
   java.lang.Thread.State: BLOCKED (on object monitor)
  at
org.apache.hadoop.filecache.TrackerDistributedCacheManager$CacheStatus.decRefCount(TrackerDistributedCacheManager.java:597)
  - waiting to lock <0xb4214308> (a java.util.LinkedHashMap)
  at
org.apache.hadoop.filecache.TrackerDistributedCacheManager.releaseCache(TrackerDistributedCacheManager.java:233)
  at
org.apache.hadoop.filecache.TaskDistributedCacheManager.release(TaskDistributedCacheManager.java:254)
  at org.apache.hadoop.mapred.TaskTracker.purgeJob(TaskTracker.java:2066)
  - locked <0xb51e5d78> (a org.apache.hadoop.mapred.TaskTracker$RunningJob)
  - locked <0xb4299248> (a org.apache.hadoop.mapred.TaskTracker)
  at org.apache.hadoop.mapred.TaskTracker$1.run(TaskTracker.java:439)
  at java.lang.Thread.run(Thread.java:619)


Looking for the lock <0xb4214308>


3=================================================
"Thread-27" prio=10 tid=0xae501400 nid=0x4021 waiting for monitor entry
[0xae4ad000]
   java.lang.Thread.State: BLOCKED (on object monitor)
  at
org.apache.hadoop.filecache.TrackerDistributedCacheManager$BaseDirManager.checkAndCleanup(TrackerDistributedCacheManager.java:1019)
  - waiting to lock <0xb52776c0> (a
org.apache.hadoop.filecache.TrackerDistributedCacheManager$CacheStatus)
  - locked <0xb4214308> (a java.util.LinkedHashMap)
  at
org.apache.hadoop.filecache.TrackerDistributedCacheManager$CleanupThread.run(TrackerDistributedCacheManager.java:948)

Looking for the lock <0xb52776c0>

4=================================================
"Thread-187419" daemon prio=10 tid=0xaa103400 nid=0x3758 runnable [0xad75c000]
   java.lang.Thread.State: RUNNABLE
  at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
  at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
  at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
  at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
  - locked <0xb52998d0> (a sun.nio.ch.Util$1)
  - locked <0xb52998e0> (a java.util.Collections$UnmodifiableSet)
  - locked <0xb5299880> (a sun.nio.ch.EPollSelectorImpl)
  at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
  at
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:332)
  at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
  at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
  at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
  at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
  at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
  at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
  - locked <0xb5505ec8> (a java.io.BufferedInputStream)
  at java.io.DataInputStream.read(DataInputStream.java:132)
  at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:153)
  at
org.apache.hadoop.hdfs.DFSClient$BlockReader.readChunk(DFSClient.java:1598)
  - locked <0xb5505ef8> (a org.apache.hadoop.hdfs.DFSClient$BlockReader)
  at
org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:237)
  at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)
  at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
  - locked <0xb5505ef8> (a org.apache.hadoop.hdfs.DFSClient$BlockReader)
  at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1457)
  - locked <0xb5505ef8> (a org.apache.hadoop.hdfs.DFSClient$BlockReader)
  at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:2172)
  - locked <0xb5505fd8> (a org.apache.hadoop.hdfs.DFSClient$DFSInputStream)
  at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2224)
  - locked <0xb5505fd8> (a org.apache.hadoop.hdfs.DFSClient$DFSInputStream)
  at java.io.DataInputStream.read(DataInputStream.java:83)
  at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:74)
  at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:47)
  at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:100)  at
org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:230)
  at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:220)
  at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:220)
  at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:220)
  at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:220)
  at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:163)
  at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1222)
  at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1203)
  at
org.apache.hadoop.filecache.TrackerDistributedCacheManager.downloadCacheObject(TrackerDistributedCacheManager.java:416)
  at
org.apache.hadoop.filecache.TrackerDistributedCacheManager.localizePublicCacheObject(TrackerDistributedCacheManager.java:464)
  at
org.apache.hadoop.filecache.TrackerDistributedCacheManager.getLocalCache(TrackerDistributedCacheManager.java:191)
  - locked <0xb52776c0> (a
org.apache.hadoop.filecache.TrackerDistributedCacheManager$CacheStatus)
  at
org.apache.hadoop.filecache.TaskDistributedCacheManager.setupCache(TaskDistributedCacheManager.java:182)
  at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1212)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:396)
  at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
  at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1203)
  at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1118)
  at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2430)
  at java.lang.Thread.run(Thread.java:619)
{noformat}
                
> Large dist cache can block tasktracker heartbeat
> ------------------------------------------------
>
>                 Key: HDFS-3843
>                 URL: https://issues.apache.org/jira/browse/HDFS-3843
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 0.20.205.0, 1.0.0
>            Reporter: Robert Joseph Evans
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to