This is an automated email from the ASF dual-hosted git repository.
zuston pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-uniffle.git
The following commit(s) were added to refs/heads/master by this push:
new 1431c8a1b [#1631] fix(server): ShuffleTaskInfo may leak when app is
removed. (#1632)
1431c8a1b is described below
commit 1431c8a1b8b53ffd2b9ab45372416e9842eba944
Author: zhengchenyu <[email protected]>
AuthorDate: Wed Apr 10 11:14:44 2024 +0800
[#1631] fix(server): ShuffleTaskInfo may leak when app is removed. (#1632)
### What changes were proposed in this pull request?
In our cluster, delete pod is denied by web hook, even though all
application is deleted for long time.
When I curl http://host:ip/metrics/server, I found app_num_with_node is 1.
The problem is some application is leaked. I see many duplicated logs
`[INFO] ShuffleTaskManager.checkResourceStatus - Detect expired
appId[appattempt_xxx_xx_xx] according to
rss.server.app.expired.withoutHeartbeat`.
When I jstack the server many times, clearResourceThread will be stuck
forever, here is the call stack.
```
"clearResourceThread" #40 daemon prio=5 os_prio=0 cpu=3767.63ms
elapsed=5393.50s tid=0x00007f24fe92e800 nid=0x8f waiting on condition
[0x00007f24f7b33000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
- parking to wait for <0x00007f28d5e29f20> (a
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
at
java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt([email protected]/AbstractQueuedSynchronizer.java:885)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued([email protected]/AbstractQueuedSynchronizer.java:917)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:1240)
at
java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock([email protected]/ReentrantReadWriteLock.java:959)
at
org.apache.uniffle.server.ShuffleTaskManager.removeResources(ShuffleTaskManager.java:756)
at
org.apache.uniffle.server.ShuffleTaskManager.lambda$new$0(ShuffleTaskManager.java:183)
at
org.apache.uniffle.server.ShuffleTaskManager$$Lambda$216/0x00007f24f824cc40.run(Unknown
Source)
at java.lang.Thread.run([email protected]/Thread.java:829)
```
Apparently there's a lock that's not being released. Looking at the code,
it's easy to see that the read lock in the flushBuffer is not released
correctly. The log ` ShuffleBufferManager.flushBuffer - Shuffle[3066071] for
app[appattempt_xxx] has already been removed, no need to flush the buffer`
proved it.
Fix: #1631
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
no test, obvious mistake
---
.../apache/uniffle/server/buffer/ShuffleBufferManager.java | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git
a/server/src/main/java/org/apache/uniffle/server/buffer/ShuffleBufferManager.java
b/server/src/main/java/org/apache/uniffle/server/buffer/ShuffleBufferManager.java
index 8f41a0795..35ef3fa09 100644
---
a/server/src/main/java/org/apache/uniffle/server/buffer/ShuffleBufferManager.java
+++
b/server/src/main/java/org/apache/uniffle/server/buffer/ShuffleBufferManager.java
@@ -298,14 +298,14 @@ public class ShuffleBufferManager {
boolean isHugePartition) {
ReentrantReadWriteLock.ReadLock readLock =
shuffleTaskManager.getAppReadLock(appId);
readLock.lock();
- if (!bufferPool.getOrDefault(appId, new
HashMap<>()).containsKey(shuffleId)) {
- LOG.info(
- "Shuffle[{}] for app[{}] has already been removed, no need to flush
the buffer",
- shuffleId,
- appId);
- return;
- }
try {
+ if (!bufferPool.getOrDefault(appId, new
HashMap<>()).containsKey(shuffleId)) {
+ LOG.info(
+ "Shuffle[{}] for app[{}] has already been removed, no need to
flush the buffer",
+ shuffleId,
+ appId);
+ return;
+ }
ShuffleDataFlushEvent event =
buffer.toFlushEvent(
appId,