manoj created HIVE-23792: ---------------------------- Summary: [LLAP] Long continuous running job degrade performance of LLAP because of leaked shuffle manager threads Key: HIVE-23792 URL: https://issues.apache.org/jira/browse/HIVE-23792 Project: Hive Issue Type: Bug Components: llap, Query Processor, Tez Affects Versions: 3.1.0 Environment: Ubuntu 18.04
Hadoop 3.1.1 TEZ: 0.9.1 HIve : 3.1.0 JDK: 1.8 Reporter: manoj Attachments: Screenshot from 2020-07-01 17-43-57.png, t3.dump, tdump.pdf *[Test Case/Reproduction]* Run TPCH Q19 on 10 Gigs data in infinite loop and disable result caching *[Observation]* On LLAP server I see a strange behaviour continuous increase in Threads.Although query will keep running but with time performance gets degrade *[Analysis]* I took multiple thread-dumps at different intervals to figure out which category of threads causing this issue, and the culprit thread is *tez-shuffle manager* .m2/org/apache/tez/tez-runtime-library/0.9.1/tez-runtime-library-0.9.1-sources.jar!/org/apache/tez/runtime/library/common/shuffle/impl/ShuffleManager.java:324 {quote}try { while ((runningFetchers.size() >= numFetchers || pendingHosts.isEmpty()) && numCompletedInputs.get() < numInputs) { inputContext.notifyProgress(); boolean ret = wakeLoop.await(1000, TimeUnit.MILLISECONDS); } } finally { lock.unlock(); }{quote} *[Stack Trace of culprit thread]* {quote}threadId:Thread 16661 - state:BLOCKED stackTrace: - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise) - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long) @bci=20, line=215 (Compiled frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(long, java.util.concurrent.TimeUnit) @bci=97, line=2163 (Compiled frame) - org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager$RunShuffleCallable.callInternal() @bci=125, line=327 (Compiled frame) - org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager$RunShuffleCallable.callInternal() @bci=1, line=311 (Compiled frame) - org.apache.tez.common.CallableWithNdc.call() @bci=8, line=36 (Compiled frame) - com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly() @bci=18, line=108 (Compiled frame) - com.google.common.util.concurrent.InterruptibleTask.run() @bci=16, line=41 (Compiled frame) - com.google.common.util.concurrent.TrustedListenableFutureTask.run() @bci=10, line=77 (Compiled frame) - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=95, line=1149 (Compiled frame) - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=624 (Compiled frame) - java.lang.Thread.run() @bci=11, line=748 (Compiled frame){quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)