This is an automated email from the ASF dual-hosted git repository.
rexxiong pushed a commit to branch branch-0.5
in repository https://gitbox.apache.org/repos/asf/celeborn.git
The following commit(s) were added to refs/heads/branch-0.5 by this push:
new 55b1752f2 [CELEBORN-1571] Fix flaky test - pushdata timeout will add
to pushExcludedWorker
55b1752f2 is described below
commit 55b1752f29d87c4f5b4d5ebb9c6383206a1bf274
Author: sychen <[email protected]>
AuthorDate: Fri Oct 11 14:13:49 2024 +0800
[CELEBORN-1571] Fix flaky test - pushdata timeout will add to
pushExcludedWorker
### What changes were proposed in this pull request?
### Why are the changes needed?
Because the worker port is in use, the driver's worker status may change
from shutdown status to unknown, causing the test to fail.
https://github.com/apache/celeborn/actions/runs/10465286274/job/28980278764
```java
- celeborn spark integration test - pushdata timeout will add to
pushExcludedWorkers *** FAILED ***
WORKER_UNKNOWN did not equal PUSH_DATA_TIMEOUT_PRIMARY, and
WORKER_UNKNOWN did not equal PUSH_DATA_TIMEOUT_REPLICA
(PushDataTimeoutTest.scala:150)
```
unit-tests.log
```
24/08/20 05:28:30,400 INFO [celeborn-dispatcher-7] Master: Receive
ReportNodeFailure [
Host: localhost
RpcPort: 41487
PushPort: 34259
FetchPort: 45713
ReplicatePort: 35107
InternalPort: 41487
24/08/20 05:29:29,414 WARN
[celeborn-client-lifecycle-manager-change-partition-executor-3]
WorkerStatusTracker:
Reporting failed workers:
Host:localhost:RpcPort:42267:PushPort:43741:FetchPort:46483:ReplicatePort:43587
PUSH_DATA_TIMEOUT_PRIMARY 2024-08-19T22:29:29.414-0700
Current unknown workers:
Host:localhost:RpcPort:41487:PushPort:34259:FetchPort:45713:ReplicatePort:35107:InternalPort:41487
2024-08-19T22:29:29.108-0700
Current shutdown workers:
Host:localhost:RpcPort:41487:PushPort:34259:FetchPort:45713:ReplicatePort:35107:InternalPort:41487
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
GA
Closes #2697 from cxzl25/CELEBORN-1571.
Authored-by: sychen <[email protected]>
Signed-off-by: Shuang <[email protected]>
(cherry picked from commit 362865f2ce313dbec4798ed752bb2ddf825f5bbf)
Signed-off-by: Shuang <[email protected]>
---
.../org/apache/celeborn/tests/spark/PushDataTimeoutTest.scala | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git
a/tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/PushDataTimeoutTest.scala
b/tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/PushDataTimeoutTest.scala
index 5bf7c1303..ea398968c 100644
---
a/tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/PushDataTimeoutTest.scala
+++
b/tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/PushDataTimeoutTest.scala
@@ -144,9 +144,12 @@ class PushDataTimeoutTest extends AnyFunSuite
.getLifecycleManager
.workerStatusTracker
.excludedWorkers
+ .asScala.filter { case (_, (code, _)) =>
+ code != StatusCode.WORKER_UNKNOWN
+ }.toMap
- assert(excludedWorkers.size() > 0)
- excludedWorkers.asScala.foreach { case (_, (code, _)) =>
+ assert(excludedWorkers.size > 0)
+ excludedWorkers.foreach { case (_, (code, _)) =>
assert(code == StatusCode.PUSH_DATA_TIMEOUT_PRIMARY ||
code == StatusCode.PUSH_DATA_TIMEOUT_REPLICA)
}