[jira] [Updated] (HDFS-13051) Fix dead lock during async editlog rolling if edit queue is full
[ https://issues.apache.org/jira/browse/HDFS-13051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Chen updated HDFS-13051: - Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 2.8.6 3.1.2 3.0.4 2.9.2 3.2.0 2.10.0 Status: Resolved (was: Patch Available) Committed to trunk all the way down to branch-2.8. There were some minor fixes I had to apply at commit time (branch-3 for log4j change, branch-2 to declare a few vars final). Compiled and ran the test when any such change is applied. Thanks [~photogamrun] for reporting the issue and providing a patch, [~daryn] for the fix, [~jojochuang] for the review and all others for commenting. > Fix dead lock during async editlog rolling if edit queue is full > > > Key: HDFS-13051 > URL: https://issues.apache.org/jira/browse/HDFS-13051 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.5 >Reporter: zhangwei >Assignee: Daryn Sharp >Priority: Major > Labels: AsyncEditlog, deadlock > Fix For: 2.10.0, 3.2.0, 2.9.2, 3.0.4, 3.1.2, 2.8.6 > > Attachments: HDFS-13112.patch, deadlock.patch > > > when doing rolleditlog it acquires fs write lock,then acquire FSEditLogAsync > lock object,and write 3 EDIT(the second one override logEdit method and > return true) > in extremely case,when FSEditLogAsync's logSync is very > slow,editPendingQ(default size 4096)is full,it case IPC thread can not offer > edit object into editPendingQ when doing rolleditlog,it block on editPendingQ > .put method,however it does't release FSEditLogAsync object lock, and > edit.logEdit method in FSEditLogAsync.run thread can never acquire > FSEditLogAsync object lock, it case dead lock > stack trace like below > "Thread[Thread-44528,5,main]" #130093 daemon prio=5 os_prio=0 > tid=0x02377000 nid=0x13fda waiting on condition [0x7fb3297de000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7fbd3cb96f58> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > at java.util.concurrent.ArrayBlockingQueue.put(ArrayBlockingQueue.java:353) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.enqueueEdit(FSEditLogAsync.java:156) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.logEdit(FSEditLogAsync.java:118) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.logCancelDelegationToken(FSEditLog.java:1008) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.logExpireDelegationToken(FSNamesystem.java:7635) > at > org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenSecretManager.logExpireToken(DelegationTokenSecretManager.java:395) > - locked <0x7fbd3cbae500> (a java.lang.Object) > at > org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenSecretManager.logExpireToken(DelegationTokenSecretManager.java:62) > at > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.removeExpiredToken(AbstractDelegationTokenSecretManager.java:604) > at > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.access$400(AbstractDelegationTokenSecretManager.java:54) > at > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager$ExpiredTokenRemover.run(AbstractDelegationTokenSecretManager.java:656) > at java.lang.Thread.run(Thread.java:745) > "FSEditLogAsync" #130072 daemon prio=5 os_prio=0 tid=0x0715b800 > nid=0x13fbf waiting for monitor entry [0x7fb32c51a000] > java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.doEditTransaction(FSEditLog.java:443) > - waiting to lock <*0x7fbcbc131000*> (a > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync$Edit.logEdit(FSEditLogAsync.java:233) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.run(FSEditLogAsync.java:177) > at java.lang.Thread.run(Thread.java:745) > "IPC Server handler 47 on 53310" #337 daemon prio=5 os_prio=0 > tid=0x7fe659d46000 nid=0x4c62 waiting on condition [0x7fb32fe52000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7fbd3cb96f58> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at
[jira] [Updated] (HDFS-13051) Fix dead lock during async editlog rolling if edit queue is full
[ https://issues.apache.org/jira/browse/HDFS-13051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Chen updated HDFS-13051: - Summary: Fix dead lock during async editlog rolling if edit queue is full (was: dead lock occurs when rolleditlog rpc call happen and editPendingQ is full) > Fix dead lock during async editlog rolling if edit queue is full > > > Key: HDFS-13051 > URL: https://issues.apache.org/jira/browse/HDFS-13051 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.5 >Reporter: zhangwei >Assignee: Daryn Sharp >Priority: Major > Labels: AsyncEditlog, deadlock > Attachments: HDFS-13112.patch, deadlock.patch > > > when doing rolleditlog it acquires fs write lock,then acquire FSEditLogAsync > lock object,and write 3 EDIT(the second one override logEdit method and > return true) > in extremely case,when FSEditLogAsync's logSync is very > slow,editPendingQ(default size 4096)is full,it case IPC thread can not offer > edit object into editPendingQ when doing rolleditlog,it block on editPendingQ > .put method,however it does't release FSEditLogAsync object lock, and > edit.logEdit method in FSEditLogAsync.run thread can never acquire > FSEditLogAsync object lock, it case dead lock > stack trace like below > "Thread[Thread-44528,5,main]" #130093 daemon prio=5 os_prio=0 > tid=0x02377000 nid=0x13fda waiting on condition [0x7fb3297de000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7fbd3cb96f58> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > at java.util.concurrent.ArrayBlockingQueue.put(ArrayBlockingQueue.java:353) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.enqueueEdit(FSEditLogAsync.java:156) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.logEdit(FSEditLogAsync.java:118) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.logCancelDelegationToken(FSEditLog.java:1008) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.logExpireDelegationToken(FSNamesystem.java:7635) > at > org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenSecretManager.logExpireToken(DelegationTokenSecretManager.java:395) > - locked <0x7fbd3cbae500> (a java.lang.Object) > at > org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenSecretManager.logExpireToken(DelegationTokenSecretManager.java:62) > at > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.removeExpiredToken(AbstractDelegationTokenSecretManager.java:604) > at > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.access$400(AbstractDelegationTokenSecretManager.java:54) > at > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager$ExpiredTokenRemover.run(AbstractDelegationTokenSecretManager.java:656) > at java.lang.Thread.run(Thread.java:745) > "FSEditLogAsync" #130072 daemon prio=5 os_prio=0 tid=0x0715b800 > nid=0x13fbf waiting for monitor entry [0x7fb32c51a000] > java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.doEditTransaction(FSEditLog.java:443) > - waiting to lock <*0x7fbcbc131000*> (a > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync$Edit.logEdit(FSEditLogAsync.java:233) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.run(FSEditLogAsync.java:177) > at java.lang.Thread.run(Thread.java:745) > "IPC Server handler 47 on 53310" #337 daemon prio=5 os_prio=0 > tid=0x7fe659d46000 nid=0x4c62 waiting on condition [0x7fb32fe52000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7fbd3cb96f58> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > at java.util.concurrent.ArrayBlockingQueue.put(ArrayBlockingQueue.java:353) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.enqueueEdit(FSEditLogAsync.java:156) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.logEdit(FSEditLogAsync.java:118) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.endCurrentLogSegment(FSEditLog.java:1251) > - locked <*0x7fbcbc131000*> (a