[ https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16088124#comment-16088124 ]
Kihwal Lee commented on HDFS-12136: ----------------------------------- [~jojochuang], we started seeing significant performance regression after increased I/O activities. Jstacking has revealed that DataXceiver threads are all waiting for the dataset impl lock. When the I/O load is reasonable, this might not be visible. {noformat} "org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer@61a9d939" #351184 daemon prio=5 os_prio=0 tid=0x00007f94ddf0a000 nid=0xafef waiting on condition [0x00007f94c1d4f000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000d55efd28> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at org.apache.hadoop.hdfs.InstrumentedLock.lock(InstrumentedLock.java:102) at org.apache.hadoop.util.AutoCloseableLock.acquire(AutoCloseableLock.java:67) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.acquireDatasetLock(FsDatasetImpl.java:3274) at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:252) at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2348) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - None {noformat} {noformat} "DataXceiver for client DFSClient_xxx [Sending block xxx]" #351183 daemon prio=5 os_prio=0 tid=0x000000000409b000 nid=0xafee waiting on condition [0x00007f94c9f49000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000d55efd28> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at org.apache.hadoop.hdfs.InstrumentedLock.lock(InstrumentedLock.java:102) at org.apache.hadoop.util.AutoCloseableLock.acquire(AutoCloseableLock.java:67) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.acquireDatasetLock(FsDatasetImpl.java:3274) at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:252) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:580) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:288) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - None {noformat} > BlockSender performance regression due to volume scanner edge case > ------------------------------------------------------------------ > > Key: HDFS-12136 > URL: https://issues.apache.org/jira/browse/HDFS-12136 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode > Affects Versions: 2.8.0 > Reporter: Daryn Sharp > Assignee: Daryn Sharp > Priority: Critical > Attachments: HDFS-12136.branch-2.patch, HDFS-12136.trunk.patch > > > HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan > by reading the last checksum of finalized blocks within the {{BlockSender}} > ctor. Unfortunately it's holding the exclusive dataset lock to open and read > the metafile multiple times Block sender instantiation becomes serialized. > Performance completely collapses under heavy disk i/o utilization or high > xceiver activity. Ex. lost node replication, balancing, or decommissioning. > The xceiver threads congest creating block senders and impair the heartbeat > processing that is contending for the same lock. Combined with other lock > contention issues, pipelines break and nodes sporadically go dead. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org