[jira] [Commented] (HADOOP-14214) DomainSocketWatcher::add()/delete() should not self interrupt while looping await()
[ https://issues.apache.org/jira/browse/HADOOP-14214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101756#comment-16101756 ] Zhe Zhang commented on HADOOP-14214: Thanks [~liuml07] for the fix. Indeed a major bug. I just backported to branch-2.7. > DomainSocketWatcher::add()/delete() should not self interrupt while looping > await() > --- > > Key: HADOOP-14214 > URL: https://issues.apache.org/jira/browse/HADOOP-14214 > Project: Hadoop Common > Issue Type: Bug > Components: hdfs-client >Reporter: Mingliang Liu >Assignee: Mingliang Liu >Priority: Critical > Fix For: 2.9.0, 2.7.4, 3.0.0-alpha4, 2.8.2 > > Attachments: HADOOP-14214.000.patch > > > Our hive team found a TPCDS job whose queries running on LLAP seem to be > getting stuck. Dozens of threads were waiting for the > {{DfsClientShmManager::lock}}, as following jstack: > {code} > Thread 251 (IO-Elevator-Thread-5): > State: WAITING > Blocked count: 3871 > Wtaited count: 4565 > Waiting on > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@16ead198 > Stack: > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:255) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1017) > > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:476) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:784) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:718) > > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:422) > > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:333) > > org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1181) > > org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1118) > org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1478) > org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1441) > org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:121) > > org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111) > > org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:166) > > org.apache.hadoop.hive.llap.io.metadata.OrcStripeMetadata.(OrcStripeMetadata.java:64) > > org.apache.hadoop.hive.llap.io.encoded.OrcEncodedDataReader.readStripesMetadata(OrcEncodedDataReader.java:622) > {code} > The thread that is expected to signal those threads is calling > {{DomainSocketWatcher::add()}} method, but it gets stuck there dealing with > InterruptedException infinitely. The jstack is like: > {code} > Thread 44417 (TezTR-257387_2840_12_10_52_0): > State: RUNNABLE > Blocked count: 3 > Wtaited count: 5 > Stack: > java.lang.Throwable.fillInStackTrace(Native Method) > java.lang.Throwable.fillInStackTrace(Throwable.java:783) > java.lang.Throwable.(Throwable.java:250) > java.lang.Exception.(Exception.java:54) > java.lang.InterruptedException.(InterruptedException.java:57) > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2034) > > org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:325) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:266) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1017) > > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:476) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:784) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:718) > > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:422) > > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:333) > >
[jira] [Commented] (HADOOP-14214) DomainSocketWatcher::add()/delete() should not self interrupt while looping await()
[ https://issues.apache.org/jira/browse/HADOOP-14214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939220#comment-15939220 ] Hudson commented on HADOOP-14214: - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11450 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/11450/]) HADOOP-14214. DomainSocketWatcher::add()/delete() should not self (liuml07: rev d35e79abc2fee7153a6168e6088f100de59d8c81) * (edit) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/unix/DomainSocketWatcher.java > DomainSocketWatcher::add()/delete() should not self interrupt while looping > await() > --- > > Key: HADOOP-14214 > URL: https://issues.apache.org/jira/browse/HADOOP-14214 > Project: Hadoop Common > Issue Type: Bug > Components: hdfs-client >Reporter: Mingliang Liu >Assignee: Mingliang Liu >Priority: Critical > Fix For: 2.8.1, 3.0.0-alpha3 > > Attachments: HADOOP-14214.000.patch > > > Our hive team found a TPCDS job whose queries running on LLAP seem to be > getting stuck. Dozens of threads were waiting for the > {{DfsClientShmManager::lock}}, as following jstack: > {code} > Thread 251 (IO-Elevator-Thread-5): > State: WAITING > Blocked count: 3871 > Wtaited count: 4565 > Waiting on > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@16ead198 > Stack: > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:255) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1017) > > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:476) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:784) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:718) > > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:422) > > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:333) > > org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1181) > > org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1118) > org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1478) > org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1441) > org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:121) > > org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111) > > org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:166) > > org.apache.hadoop.hive.llap.io.metadata.OrcStripeMetadata.(OrcStripeMetadata.java:64) > > org.apache.hadoop.hive.llap.io.encoded.OrcEncodedDataReader.readStripesMetadata(OrcEncodedDataReader.java:622) > {code} > The thread that is expected to signal those threads is calling > {{DomainSocketWatcher::add()}} method, but it gets stuck there dealing with > InterruptedException infinitely. The jstack is like: > {code} > Thread 44417 (TezTR-257387_2840_12_10_52_0): > State: RUNNABLE > Blocked count: 3 > Wtaited count: 5 > Stack: > java.lang.Throwable.fillInStackTrace(Native Method) > java.lang.Throwable.fillInStackTrace(Throwable.java:783) > java.lang.Throwable.(Throwable.java:250) > java.lang.Exception.(Exception.java:54) > java.lang.InterruptedException.(InterruptedException.java:57) > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2034) > > org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:325) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:266) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1017) > > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:476) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:784) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:718) > >
[jira] [Commented] (HADOOP-14214) DomainSocketWatcher::add()/delete() should not self interrupt while looping await()
[ https://issues.apache.org/jira/browse/HADOOP-14214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939179#comment-15939179 ] Tsz Wo Nicholas Sze commented on HADOOP-14214: -- The try-await-catch-InterruptedException-interrupt is clearly a bug. Using awaitUninterruptibly sounds good. +1 on the patch. > DomainSocketWatcher::add()/delete() should not self interrupt while looping > await() > --- > > Key: HADOOP-14214 > URL: https://issues.apache.org/jira/browse/HADOOP-14214 > Project: Hadoop Common > Issue Type: Bug > Components: hdfs-client >Reporter: Mingliang Liu >Assignee: Mingliang Liu >Priority: Critical > Attachments: HADOOP-14214.000.patch > > > Our hive team found a TPCDS job whose queries running on LLAP seem to be > getting stuck. Dozens of threads were waiting for the > {{DfsClientShmManager::lock}}, as following jstack: > {code} > Thread 251 (IO-Elevator-Thread-5): > State: WAITING > Blocked count: 3871 > Wtaited count: 4565 > Waiting on > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@16ead198 > Stack: > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:255) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1017) > > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:476) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:784) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:718) > > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:422) > > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:333) > > org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1181) > > org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1118) > org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1478) > org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1441) > org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:121) > > org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111) > > org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:166) > > org.apache.hadoop.hive.llap.io.metadata.OrcStripeMetadata.(OrcStripeMetadata.java:64) > > org.apache.hadoop.hive.llap.io.encoded.OrcEncodedDataReader.readStripesMetadata(OrcEncodedDataReader.java:622) > {code} > The thread that is expected to signal those threads is calling > {{DomainSocketWatcher::add()}} method, but it gets stuck there dealing with > InterruptedException infinitely. The jstack is like: > {code} > Thread 44417 (TezTR-257387_2840_12_10_52_0): > State: RUNNABLE > Blocked count: 3 > Wtaited count: 5 > Stack: > java.lang.Throwable.fillInStackTrace(Native Method) > java.lang.Throwable.fillInStackTrace(Throwable.java:783) > java.lang.Throwable.(Throwable.java:250) > java.lang.Exception.(Exception.java:54) > java.lang.InterruptedException.(InterruptedException.java:57) > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2034) > > org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:325) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:266) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1017) > > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:476) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:784) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:718) > > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:422) > > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:333) > >
[jira] [Commented] (HADOOP-14214) DomainSocketWatcher::add()/delete() should not self interrupt while looping await()
[ https://issues.apache.org/jira/browse/HADOOP-14214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937739#comment-15937739 ] Jitendra Nath Pandey commented on HADOOP-14214: --- The wait is to ensure the domain-socket fd is added to the DomainSocketWatcher thread's collection. It is a very short wait, but important so that fd's are managed appropriately. The interrupt is not lost, {{awaitUninterruptibly}} ensures interrupt is set when it returns. > DomainSocketWatcher::add()/delete() should not self interrupt while looping > await() > --- > > Key: HADOOP-14214 > URL: https://issues.apache.org/jira/browse/HADOOP-14214 > Project: Hadoop Common > Issue Type: Bug > Components: hdfs-client >Reporter: Mingliang Liu >Assignee: Mingliang Liu >Priority: Critical > Attachments: HADOOP-14214.000.patch > > > Our hive team found a TPCDS job whose queries running on LLAP seem to be > getting stuck. Dozens of threads were waiting for the > {{DfsClientShmManager::lock}}, as following jstack: > {code} > Thread 251 (IO-Elevator-Thread-5): > State: WAITING > Blocked count: 3871 > Wtaited count: 4565 > Waiting on > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@16ead198 > Stack: > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:255) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1017) > > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:476) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:784) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:718) > > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:422) > > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:333) > > org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1181) > > org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1118) > org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1478) > org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1441) > org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:121) > > org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111) > > org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:166) > > org.apache.hadoop.hive.llap.io.metadata.OrcStripeMetadata.(OrcStripeMetadata.java:64) > > org.apache.hadoop.hive.llap.io.encoded.OrcEncodedDataReader.readStripesMetadata(OrcEncodedDataReader.java:622) > {code} > The thread that is expected to signal those threads is calling > {{DomainSocketWatcher::add()}} method, but it gets stuck there dealing with > InterruptedException infinitely. The jstack is like: > {code} > Thread 44417 (TezTR-257387_2840_12_10_52_0): > State: RUNNABLE > Blocked count: 3 > Wtaited count: 5 > Stack: > java.lang.Throwable.fillInStackTrace(Native Method) > java.lang.Throwable.fillInStackTrace(Throwable.java:783) > java.lang.Throwable.(Throwable.java:250) > java.lang.Exception.(Exception.java:54) > java.lang.InterruptedException.(InterruptedException.java:57) > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2034) > > org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:325) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:266) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1017) > > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:476) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:784) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:718) > > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:422) > >
[jira] [Commented] (HADOOP-14214) DomainSocketWatcher::add()/delete() should not self interrupt while looping await()
[ https://issues.apache.org/jira/browse/HADOOP-14214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937587#comment-15937587 ] Hadoop QA commented on HADOOP-14214: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 25s{color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 12m 39s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 20m 17s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 35s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 0s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 19s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 25s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 50s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 37s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 15m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 15m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 0s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 19s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 38s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 49s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 7m 44s{color} | {color:red} hadoop-common in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 33s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 68m 36s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.security.TestRaceWhenRelogin | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:a9ad5d6 | | JIRA Issue | HADOOP-14214 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12860055/HADOOP-14214.000.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux a20fbdc56866 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / f462e1f | | Default Java | 1.8.0_121 | | findbugs | v3.0.0 | | unit | https://builds.apache.org/job/PreCommit-HADOOP-Build/11889/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common.txt | | Test Results | https://builds.apache.org/job/PreCommit-HADOOP-Build/11889/testReport/ | | modules | C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common | | Console output | https://builds.apache.org/job/PreCommit-HADOOP-Build/11889/console | | Powered by | Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > DomainSocketWatcher::add()/delete() should not self interrupt while looping > await() > --- > > Key: HADOOP-14214 > URL:
[jira] [Commented] (HADOOP-14214) DomainSocketWatcher::add()/delete() should not self interrupt while looping await()
[ https://issues.apache.org/jira/browse/HADOOP-14214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937522#comment-15937522 ] Sergey Shelukhin commented on HADOOP-14214: --- Hmm.. the reason we are interrupting the thread in question is because we want it to be interrupted (because the work it's performing is no longer relevant). Wouldn't this just cause it to be stuck forever anyway, or at best to continue a useless operation? > DomainSocketWatcher::add()/delete() should not self interrupt while looping > await() > --- > > Key: HADOOP-14214 > URL: https://issues.apache.org/jira/browse/HADOOP-14214 > Project: Hadoop Common > Issue Type: Bug > Components: hdfs-client >Reporter: Mingliang Liu >Assignee: Mingliang Liu >Priority: Critical > Attachments: HADOOP-14214.000.patch > > > Our hive team found a TPCDS job whose queries running on LLAP seem to be > getting stuck. Dozens of threads were waiting for the > {{DfsClientShmManager::lock}}, as following jstack: > {code} > Thread 251 (IO-Elevator-Thread-5): > State: WAITING > Blocked count: 3871 > Wtaited count: 4565 > Waiting on > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@16ead198 > Stack: > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:255) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1017) > > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:476) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:784) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:718) > > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:422) > > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:333) > > org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1181) > > org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1118) > org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1478) > org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1441) > org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:121) > > org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111) > > org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:166) > > org.apache.hadoop.hive.llap.io.metadata.OrcStripeMetadata.(OrcStripeMetadata.java:64) > > org.apache.hadoop.hive.llap.io.encoded.OrcEncodedDataReader.readStripesMetadata(OrcEncodedDataReader.java:622) > {code} > The thread that is expected to signal those threads is calling > {{DomainSocketWatcher::add()}} method, but it gets stuck there dealing with > InterruptedException infinitely. The jstack is like: > {code} > Thread 44417 (TezTR-257387_2840_12_10_52_0): > State: RUNNABLE > Blocked count: 3 > Wtaited count: 5 > Stack: > java.lang.Throwable.fillInStackTrace(Native Method) > java.lang.Throwable.fillInStackTrace(Throwable.java:783) > java.lang.Throwable.(Throwable.java:250) > java.lang.Exception.(Exception.java:54) > java.lang.InterruptedException.(InterruptedException.java:57) > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2034) > > org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:325) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:266) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1017) > > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:476) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:784) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:718) > > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:422) > >
[jira] [Commented] (HADOOP-14214) DomainSocketWatcher::add()/delete() should not self interrupt while looping await()
[ https://issues.apache.org/jira/browse/HADOOP-14214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937039#comment-15937039 ] Mingliang Liu commented on HADOOP-14214: Ping [~cmccabe], [~jnp], [~arpitagarwal] for discussion. > DomainSocketWatcher::add()/delete() should not self interrupt while looping > await() > --- > > Key: HADOOP-14214 > URL: https://issues.apache.org/jira/browse/HADOOP-14214 > Project: Hadoop Common > Issue Type: Bug >Reporter: Mingliang Liu >Assignee: Mingliang Liu > > Our hive team found a TPCDS job whose queries running on LLAP seem to be > getting stuck. Dozens of threads were waiting for the > {{DfsClientShmManager::lock}}, as following jstack: > {code} > Thread 251 (IO-Elevator-Thread-5): > State: WAITING > Blocked count: 3871 > Wtaited count: 4565 > Waiting on > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@16ead198 > Stack: > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:255) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1017) > > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:476) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:784) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:718) > > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:422) > > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:333) > > org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1181) > > org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1118) > org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1478) > org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1441) > org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:121) > > org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111) > > org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:166) > > org.apache.hadoop.hive.llap.io.metadata.OrcStripeMetadata.(OrcStripeMetadata.java:64) > > org.apache.hadoop.hive.llap.io.encoded.OrcEncodedDataReader.readStripesMetadata(OrcEncodedDataReader.java:622) > {code} > The thread that is expected to signal those threads is calling > {{DomainSocketWatcher::add()}} method, but it gets stuck there dealing with > InterruptedException infinitely. The jstack is like: > {code} > Thread 44417 (TezTR-257387_2840_12_10_52_0): > State: RUNNABLE > Blocked count: 3 > Wtaited count: 5 > Stack: > java.lang.Throwable.fillInStackTrace(Native Method) > java.lang.Throwable.fillInStackTrace(Throwable.java:783) > java.lang.Throwable.(Throwable.java:250) > java.lang.Exception.(Exception.java:54) > java.lang.InterruptedException.(InterruptedException.java:57) > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2034) > > org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:325) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:266) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1017) > > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:476) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:784) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:718) > > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:422) > > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:333) > > org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1181) > > org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1118) > org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1478) >
[jira] [Commented] (HADOOP-14214) DomainSocketWatcher::add()/delete() should not self interrupt while looping await()
[ https://issues.apache.org/jira/browse/HADOOP-14214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937034#comment-15937034 ] Mingliang Liu commented on HADOOP-14214: One possible fix is to call {{awaitUninterruptibly()}} instead of {{await()}}. > DomainSocketWatcher::add()/delete() should not self interrupt while looping > await() > --- > > Key: HADOOP-14214 > URL: https://issues.apache.org/jira/browse/HADOOP-14214 > Project: Hadoop Common > Issue Type: Bug >Reporter: Mingliang Liu >Assignee: Mingliang Liu > > Our hive team found a TPCDS job whose queries running on LLAP seem to be > getting stuck. Dozens of threads were waiting for the > {{DfsClientShmManager::lock}}, as following jstack: > {code} > Thread 251 (IO-Elevator-Thread-5): > State: WAITING > Blocked count: 3871 > Wtaited count: 4565 > Waiting on > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@16ead198 > Stack: > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:255) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1017) > > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:476) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:784) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:718) > > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:422) > > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:333) > > org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1181) > > org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1118) > org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1478) > org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1441) > org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:121) > > org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111) > > org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:166) > > org.apache.hadoop.hive.llap.io.metadata.OrcStripeMetadata.(OrcStripeMetadata.java:64) > > org.apache.hadoop.hive.llap.io.encoded.OrcEncodedDataReader.readStripesMetadata(OrcEncodedDataReader.java:622) > {code} > The thread that is expected to signal those threads is calling > {{DomainSocketWatcher::add()}} method, but it gets stuck there dealing with > InterruptedException infinitely. The jstack is like: > {code} > Thread 44417 (TezTR-257387_2840_12_10_52_0): > State: RUNNABLE > Blocked count: 3 > Wtaited count: 5 > Stack: > java.lang.Throwable.fillInStackTrace(Native Method) > java.lang.Throwable.fillInStackTrace(Throwable.java:783) > java.lang.Throwable.(Throwable.java:250) > java.lang.Exception.(Exception.java:54) > java.lang.InterruptedException.(InterruptedException.java:57) > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2034) > > org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:325) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:266) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1017) > > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:476) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:784) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:718) > > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:422) > > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:333) > > org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1181) > > org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1118) >