This is an automated email from the ASF dual-hosted git repository.
csy pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/celeborn.git
The following commit(s) were added to refs/heads/main by this push:
new 6b3df4722 [CELEBORN-1643] DataPusher handle InterruptedException
6b3df4722 is described below
commit 6b3df472277ab714039f57cc744a5084b966a0b7
Author: sychen <[email protected]>
AuthorDate: Tue Oct 15 20:54:14 2024 +0800
[CELEBORN-1643] DataPusher handle InterruptedException
### What changes were proposed in this pull request?
### Why are the changes needed?
The kill task will interrupt `pushThread`, `pushThread` may not call
`reclaimTask`, and `idleQueue` is still free at this time, causing the Task to
be in the `waitIdleQueueFullWithLock` state and not exit.
Problems caused by CELEBORN-1544.
```java
24/10/10 15:43:43,103 [Executor task launch worker for task 356.1 in stage
4447.0 (TID 1126065)] ERROR DataPusher: DataPusher thread interrupted while
adding push task.
24/10/10 15:43:43,103 [DataPusher-1126065] INFO DataPushQueue: Thread
interrupted while waiting push task.
24/10/10 15:43:43,103 [DataPusher-1126065] ERROR DataPusher: DataPusher
push thread interrupted while pushing data.
24/10/10 15:43:53,099 [Task reaper-6] WARN Executor: Killed task 1126065 is
still running after 10000 ms
24/10/10 15:43:53,157 [Task reaper-6] WARN Executor: Thread dump from task
1126065:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2163)
org.apache.celeborn.client.write.DataPusher.waitIdleQueueFullWithLock(DataPusher.java:215)
org.apache.celeborn.client.write.DataPusher.waitOnTermination(DataPusher.java:167)
org.apache.spark.shuffle.celeborn.SortBasedPusher.close(SortBasedPusher.java:453)
org.apache.spark.shuffle.celeborn.SortBasedShuffleWriter.cleanupPusher(SortBasedShuffleWriter.java:379)
org.apache.spark.shuffle.celeborn.SortBasedShuffleWriter.write(SortBasedShuffleWriter.java:240)
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
GA
Closes #2805 from cxzl25/CELEBORN-1643.
Authored-by: sychen <[email protected]>
Signed-off-by: Shaoyun Chen <[email protected]>
---
client/src/main/java/org/apache/celeborn/client/write/DataPusher.java | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git
a/client/src/main/java/org/apache/celeborn/client/write/DataPusher.java
b/client/src/main/java/org/apache/celeborn/client/write/DataPusher.java
index 5acfbe938..807952d15 100644
--- a/client/src/main/java/org/apache/celeborn/client/write/DataPusher.java
+++ b/client/src/main/java/org/apache/celeborn/client/write/DataPusher.java
@@ -213,7 +213,9 @@ public class DataPusher {
private void waitIdleQueueFullWithLock() throws InterruptedException {
try {
- while (idleQueue.remainingCapacity() > 0 && exceptionRef.get() == null) {
+ while (idleQueue.remainingCapacity() > 0
+ && exceptionRef.get() == null
+ && (pushThread != null && pushThread.isAlive())) {
idleFull.await(WAIT_TIME_NANOS, TimeUnit.NANOSECONDS);
}
} catch (InterruptedException e) {