This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
     new d37a8b9c97b3 [SPARK-49983][CORE][TESTS] Fix 
`BarrierTaskContextSuite.successively sync with allGather and barrier` test 
case to be robust
d37a8b9c97b3 is described below

commit d37a8b9c97b33ede4c7c42ec8c16b7c152d42b86
Author: Dongjoon Hyun <[email protected]>
AuthorDate: Wed Oct 16 07:24:33 2024 -0700

    [SPARK-49983][CORE][TESTS] Fix `BarrierTaskContextSuite.successively sync 
with allGather and barrier` test case to be robust
    
    ### What changes were proposed in this pull request?
    
    This PR aims to fix `BarrierTaskContextSuite.successively sync with 
allGather and barrier` test case to be robust.
    
    ### Why are the changes needed?
    
    The test case asserts the duration of partitions. However, this is flaky 
because we don't know when a partition is triggered before `barrier` sync.
    
    
https://github.com/apache/spark/blob/0e75d19a736aa18fe77414991ebb7e3577a43af8/core/src/test/scala/org/apache/spark/scheduler/BarrierTaskContextSuite.scala#L116-L118
    
    Although we added `TestUtils.waitUntilExecutorsUp` at Apache Spark 3.0.0 
like the following,
    
    - #28658
    
    let's say a partition starts slowly than `38ms` and all partitions sleep 
`1s` exactly. Then, the test case fails like the following.
    - https://github.com/apache/spark/actions/runs/11298639789/job/31428018075
    ```
    BarrierTaskContextSuite:
    ...
    - successively sync with allGather and barrier *** FAILED ***
      1038 was not less than or equal to 1000 
(BarrierTaskContextSuite.scala:118)
    ```
    
    According to the failure history here (SPARK-49983) and SPARK-31730, the 
slowness seems to be less than `200ms` when it happens. So, this PR aims to 
reduce the flakiness by capping the sleep up to 500ms while keeping the `1s` 
validation. There is no test coverage change because this test case focuses on 
the `successively sync with allGather and battier`.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, this is a test-only test case.
    
    ### How was this patch tested?
    
    Pass the CIs.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #48487 from dongjoon-hyun/SPARK-49983.
    
    Authored-by: Dongjoon Hyun <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    (cherry picked from commit bcfe62b9988f9b00c23de0b71acc1c6170edee9e)
    Signed-off-by: Dongjoon Hyun <[email protected]>
---
 .../test/scala/org/apache/spark/scheduler/BarrierTaskContextSuite.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/core/src/test/scala/org/apache/spark/scheduler/BarrierTaskContextSuite.scala 
b/core/src/test/scala/org/apache/spark/scheduler/BarrierTaskContextSuite.scala
index 26cd5374fa09..f370a8e02391 100644
--- 
a/core/src/test/scala/org/apache/spark/scheduler/BarrierTaskContextSuite.scala
+++ 
b/core/src/test/scala/org/apache/spark/scheduler/BarrierTaskContextSuite.scala
@@ -101,7 +101,7 @@ class BarrierTaskContextSuite extends SparkFunSuite with 
LocalSparkContext with
     val rdd2 = rdd.barrier().mapPartitions { it =>
       val context = BarrierTaskContext.get()
       // Sleep for a random time before global sync.
-      Thread.sleep(Random.nextInt(1000))
+      Thread.sleep(Random.nextInt(500))
       context.barrier()
       val time1 = System.currentTimeMillis()
       // Sleep for a random time before global sync.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to