This is an automated email from the ASF dual-hosted git repository.

zuston pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-uniffle.git


The following commit(s) were added to refs/heads/master by this push:
     new 1431c8a1b [#1631] fix(server): ShuffleTaskInfo may leak when app is 
removed. (#1632)
1431c8a1b is described below

commit 1431c8a1b8b53ffd2b9ab45372416e9842eba944
Author: zhengchenyu <[email protected]>
AuthorDate: Wed Apr 10 11:14:44 2024 +0800

    [#1631] fix(server): ShuffleTaskInfo may leak when app is removed. (#1632)
    
    ### What changes were proposed in this pull request?
    
    In our cluster, delete pod is denied by web hook, even though all 
application is deleted for long time.
    When I curl http://host:ip/metrics/server, I found app_num_with_node is 1.
    The problem is some application is leaked. I see many duplicated logs 
`[INFO] ShuffleTaskManager.checkResourceStatus - Detect expired 
appId[appattempt_xxx_xx_xx] according to 
rss.server.app.expired.withoutHeartbeat`.
    When I jstack the server many times, clearResourceThread will be stuck 
forever, here is the call stack.
    ```
    "clearResourceThread" #40 daemon prio=5 os_prio=0 cpu=3767.63ms 
elapsed=5393.50s tid=0x00007f24fe92e800 nid=0x8f waiting on condition  
[0x00007f24f7b33000]
       java.lang.Thread.State: WAITING (parking)
            at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
            - parking to wait for  <0x00007f28d5e29f20> (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
            at 
java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
            at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt([email protected]/AbstractQueuedSynchronizer.java:885)
            at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued([email protected]/AbstractQueuedSynchronizer.java:917)
            at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:1240)
            at 
java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock([email protected]/ReentrantReadWriteLock.java:959)
            at 
org.apache.uniffle.server.ShuffleTaskManager.removeResources(ShuffleTaskManager.java:756)
            at 
org.apache.uniffle.server.ShuffleTaskManager.lambda$new$0(ShuffleTaskManager.java:183)
            at 
org.apache.uniffle.server.ShuffleTaskManager$$Lambda$216/0x00007f24f824cc40.run(Unknown
 Source)
            at java.lang.Thread.run([email protected]/Thread.java:829)
    ```
    
    Apparently there's a lock that's not being released. Looking at the code, 
it's easy to see that the read lock in the flushBuffer is not released 
correctly.  The log ` ShuffleBufferManager.flushBuffer - Shuffle[3066071] for 
app[appattempt_xxx] has already been removed, no need to flush the buffer` 
proved it.
    
    Fix: #1631
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    no test, obvious mistake
---
 .../apache/uniffle/server/buffer/ShuffleBufferManager.java | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git 
a/server/src/main/java/org/apache/uniffle/server/buffer/ShuffleBufferManager.java
 
b/server/src/main/java/org/apache/uniffle/server/buffer/ShuffleBufferManager.java
index 8f41a0795..35ef3fa09 100644
--- 
a/server/src/main/java/org/apache/uniffle/server/buffer/ShuffleBufferManager.java
+++ 
b/server/src/main/java/org/apache/uniffle/server/buffer/ShuffleBufferManager.java
@@ -298,14 +298,14 @@ public class ShuffleBufferManager {
       boolean isHugePartition) {
     ReentrantReadWriteLock.ReadLock readLock = 
shuffleTaskManager.getAppReadLock(appId);
     readLock.lock();
-    if (!bufferPool.getOrDefault(appId, new 
HashMap<>()).containsKey(shuffleId)) {
-      LOG.info(
-          "Shuffle[{}] for app[{}] has already been removed, no need to flush 
the buffer",
-          shuffleId,
-          appId);
-      return;
-    }
     try {
+      if (!bufferPool.getOrDefault(appId, new 
HashMap<>()).containsKey(shuffleId)) {
+        LOG.info(
+            "Shuffle[{}] for app[{}] has already been removed, no need to 
flush the buffer",
+            shuffleId,
+            appId);
+        return;
+      }
       ShuffleDataFlushEvent event =
           buffer.toFlushEvent(
               appId,

Reply via email to