[GitHub] [spark] Tagar commented on a change in pull request #32766: [SPARK-35627][CORE] Decommission executors in batches to not overload network bandwidth

GitBox Thu, 03 Jun 2021 13:42:21 -0700


Tagar commented on a change in pull request #32766:
URL: https://github.com/apache/spark/pull/32766#discussion_r645111860




##########
File path: core/src/main/scala/org/apache/spark/internal/config/package.scala
##########
@@ -2023,6 +2023,22 @@ package object config {
       .stringConf
       .createWithDefaultString("PWR")
 
+  private[spark] val EXECUTOR_DECOMMISSION_BATCH_INTERVAL =
+    ConfigBuilder("spark.executor.decommission.batchInterval")
+      .doc("Executors are decommissioned in batched to avoid overloading 
network bandwidth in" +
+        " migrating rdd and shuffle data. This config sets the interval 
between batches.")
+      .version("3.2.0")
+      .timeConf(TimeUnit.MILLISECONDS)
+      .createWithDefault(30000)

Review comment:
       Azure gives a very limited time of just 30s for spot instance removal. 
   With these defaults only 3 nodes will get a chance to evacuate shuffle/cache 
data. 
   (9 nodes on AWS with the 90s notification)
   Any chance to make this (much) smaller? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] Tagar commented on a change in pull request #32766: [SPARK-35627][CORE] Decommission executors in batches to not overload network bandwidth

Reply via email to