This is an automated email from the ASF dual-hosted git repository.

rexxiong pushed a commit to branch branch-0.5
in repository https://gitbox.apache.org/repos/asf/celeborn.git


The following commit(s) were added to refs/heads/branch-0.5 by this push:
     new 55b1752f2 [CELEBORN-1571] Fix flaky test - pushdata timeout will add 
to pushExcludedWorker
55b1752f2 is described below

commit 55b1752f29d87c4f5b4d5ebb9c6383206a1bf274
Author: sychen <[email protected]>
AuthorDate: Fri Oct 11 14:13:49 2024 +0800

    [CELEBORN-1571] Fix flaky test - pushdata timeout will add to 
pushExcludedWorker
    
    ### What changes were proposed in this pull request?
    
    ### Why are the changes needed?
    
    Because the worker port is in use, the driver's worker status may change 
from shutdown status to unknown, causing the test to fail.
    
    https://github.com/apache/celeborn/actions/runs/10465286274/job/28980278764
    ```java
    - celeborn spark integration test - pushdata timeout will add to 
pushExcludedWorkers *** FAILED ***
      WORKER_UNKNOWN did not equal PUSH_DATA_TIMEOUT_PRIMARY, and 
WORKER_UNKNOWN did not equal PUSH_DATA_TIMEOUT_REPLICA 
(PushDataTimeoutTest.scala:150)
    ```
    
    unit-tests.log
    ```
    24/08/20 05:28:30,400 INFO [celeborn-dispatcher-7] Master: Receive 
ReportNodeFailure [
    Host: localhost
    RpcPort: 41487
    PushPort: 34259
    FetchPort: 45713
    ReplicatePort: 35107
    InternalPort: 41487
    
    24/08/20 05:29:29,414 WARN 
[celeborn-client-lifecycle-manager-change-partition-executor-3] 
WorkerStatusTracker:
    Reporting failed workers:
    
Host:localhost:RpcPort:42267:PushPort:43741:FetchPort:46483:ReplicatePort:43587 
  PUSH_DATA_TIMEOUT_PRIMARY   2024-08-19T22:29:29.414-0700
    Current unknown workers:
    
Host:localhost:RpcPort:41487:PushPort:34259:FetchPort:45713:ReplicatePort:35107:InternalPort:41487
   2024-08-19T22:29:29.108-0700
    Current shutdown workers:
    
Host:localhost:RpcPort:41487:PushPort:34259:FetchPort:45713:ReplicatePort:35107:InternalPort:41487
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    GA
    
    Closes #2697 from cxzl25/CELEBORN-1571.
    
    Authored-by: sychen <[email protected]>
    Signed-off-by: Shuang <[email protected]>
    (cherry picked from commit 362865f2ce313dbec4798ed752bb2ddf825f5bbf)
    Signed-off-by: Shuang <[email protected]>
---
 .../org/apache/celeborn/tests/spark/PushDataTimeoutTest.scala      | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git 
a/tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/PushDataTimeoutTest.scala
 
b/tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/PushDataTimeoutTest.scala
index 5bf7c1303..ea398968c 100644
--- 
a/tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/PushDataTimeoutTest.scala
+++ 
b/tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/PushDataTimeoutTest.scala
@@ -144,9 +144,12 @@ class PushDataTimeoutTest extends AnyFunSuite
       .getLifecycleManager
       .workerStatusTracker
       .excludedWorkers
+      .asScala.filter { case (_, (code, _)) =>
+        code != StatusCode.WORKER_UNKNOWN
+      }.toMap
 
-    assert(excludedWorkers.size() > 0)
-    excludedWorkers.asScala.foreach { case (_, (code, _)) =>
+    assert(excludedWorkers.size > 0)
+    excludedWorkers.foreach { case (_, (code, _)) =>
       assert(code == StatusCode.PUSH_DATA_TIMEOUT_PRIMARY ||
         code == StatusCode.PUSH_DATA_TIMEOUT_REPLICA)
     }

Reply via email to