manoj created HIVE-23792:
----------------------------

             Summary: [LLAP] Long continuous running job degrade performance of 
LLAP because of leaked shuffle manager threads
                 Key: HIVE-23792
                 URL: https://issues.apache.org/jira/browse/HIVE-23792
             Project: Hive
          Issue Type: Bug
          Components: llap, Query Processor, Tez
    Affects Versions: 3.1.0
         Environment: Ubuntu 18.04

Hadoop 3.1.1

TEZ: 0.9.1

HIve : 3.1.0

JDK: 1.8

 
            Reporter: manoj
         Attachments: Screenshot from 2020-07-01 17-43-57.png, t3.dump, 
tdump.pdf

*[Test Case/Reproduction]*

Run TPCH Q19 on 10 Gigs data in infinite loop and disable result caching 

*[Observation]*

On LLAP server I see a strange behaviour continuous increase in 
Threads.Although query will keep running but with time performance gets degrade 

*[Analysis]*

I took multiple thread-dumps at different intervals to figure out which 
category of threads causing this issue, and the culprit thread is *tez-shuffle 
manager*

.m2/org/apache/tez/tez-runtime-library/0.9.1/tez-runtime-library-0.9.1-sources.jar!/org/apache/tez/runtime/library/common/shuffle/impl/ShuffleManager.java:324

{quote}try {
 while ((runningFetchers.size() >= numFetchers || pendingHosts.isEmpty())
 && numCompletedInputs.get() < numInputs) {
 inputContext.notifyProgress();
 boolean ret = wakeLoop.await(1000, TimeUnit.MILLISECONDS);
 }
} finally {
 lock.unlock();
}{quote}

 

*[Stack Trace of culprit thread]*
{quote}threadId:Thread 16661 - state:BLOCKED
 stackTrace:
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may 
be imprecise)
 - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long) 
@bci=20, line=215 (Compiled frame)
 - 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(long,
 java.util.concurrent.TimeUnit) @bci=97, line=2163 (Compiled frame)
 - 
org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager$RunShuffleCallable.callInternal()
 @bci=125, line=327 (Compiled frame)
 - 
org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager$RunShuffleCallable.callInternal()
 @bci=1, line=311 (Compiled frame)
 - org.apache.tez.common.CallableWithNdc.call() @bci=8, line=36 (Compiled frame)
 - 
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly()
 @bci=18, line=108 (Compiled frame)
 - com.google.common.util.concurrent.InterruptibleTask.run() @bci=16, line=41 
(Compiled frame)
 - com.google.common.util.concurrent.TrustedListenableFutureTask.run() @bci=10, 
line=77 (Compiled frame)
 - 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
 @bci=95, line=1149 (Compiled frame)
 - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=624 
(Compiled frame)
 - java.lang.Thread.run() @bci=11, line=748 (Compiled frame){quote}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to