[ 
https://issues.apache.org/jira/browse/SPARK-38019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38019:
----------------------------------
        Parent:     (was: SPARK-35781)
    Issue Type: Bug  (was: Sub-task)

> ExecutorMonitor.timedOutExecutors should be deterministic
> ---------------------------------------------------------
>
>                 Key: SPARK-38019
>                 URL: https://issues.apache.org/jira/browse/SPARK-38019
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.2.0, 3.2.1, 3.3.0
>            Reporter: Dongjoon Hyun
>            Assignee: Dongjoon Hyun
>            Priority: Major
>             Fix For: 3.3.0, 3.2.2
>
>
> Since the AS-IS timedOutExecutors returns the result indeterministic, it 
> kills the executors in a random order at Dynamic Allocation setting.
> spark/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala
> {code}
>  private val executors = new ConcurrentHashMap[String, Tracker]() 
> ...
>  timedOutExecs = executors.asScala 
> {code}
> This random behavior not only makes the users confusing but also causes a K8s 
> decommission tests flaky like the following case in Java 17 on Apple Silicon 
> environment. The K8s test expects the decommission of executor 1 while the 
> executor 2 is chosen at this time.
> {code}
> 22/01/25 06:11:16 DEBUG ExecutorMonitor: Executors 1,2 do not have active 
> shuffle data after job 0 finished.
> 22/01/25 06:11:16 DEBUG ExecutorAllocationManager: max needed for rpId: 0 
> numpending: 0, tasksperexecutor: 1
> 22/01/25 06:11:16 DEBUG ExecutorAllocationManager: No change in number of 
> executors
> 22/01/25 06:11:16 DEBUG ExecutorAllocationManager: Request to remove 
> executorIds: (2,0), (1,0)
> 22/01/25 06:11:16 DEBUG ExecutorAllocationManager: Not removing idle executor 
> 1 because there are only 1 executor(s) left (minimum number of executor limit 
> 1)
> 22/01/25 06:11:16 INFO KubernetesClusterSchedulerBackend: Decommission 
> executors: 2
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to