dongjoon-hyun commented on pull request #30876:
URL: https://github.com/apache/spark/pull/30876#issuecomment-750471287


   Thank you, @mridulm . Sure! 
   
   In terms of resource management, currently `K8s environment` is more 
aggressive than the other existing resource managers. For example, not only the 
competition between heterogeneous apps(Spark/Hive/other kind of jobs) in the 
cluster occurs, but also the cluster size itself is dynamically adjusted.
   
   Your recent series of external shuffle service patches are helpful of 
course. And, there are more ways like this. For example, there is `Worker 
Decommission` and its `spark.storage.decommission.rddBlocks.enabled`. If it's 
enabled, `BlockManagerDecommissioner`'s `rddBlockMigrationRunnable` is trying 
to migrate all RDD blocks by using `BlockManager.replicateBlock`. It's 
logically similar with `spark.storage.replication.proactive=true`.
   ```scala
     private def migrateBlock(blockToReplicate: ReplicateBlock): Boolean = {
       val replicatedSuccessfully = bm.replicateBlock(
         blockToReplicate.blockId,
         blockToReplicate.replicas.toSet,
         blockToReplicate.maxReplicas,
         maxReplicationFailures = Some(maxReplicationFailuresForDecommission))
       if (replicatedSuccessfully) {
         logInfo(s"Block ${blockToReplicate.blockId} offloaded successfully, 
Removing block now")
         bm.removeBlock(blockToReplicate.blockId)
         logInfo(s"Block ${blockToReplicate.blockId} removed")
       } else {
         logWarning(s"Failed to offload block ${blockToReplicate.blockId}")
       }
       replicatedSuccessfully
     }
   ```
   
   However, sometimes, K8s doesn't wait for an enough time and kill executors 
during migration processing at its predefined grace period. At that time, 
`spark.storage.replication.proactive` become important again to recover it. 
`spark.storage.replication.proactive` is much less chatty than 
`BlockManagerDecommissioner` which tries to migrate all storage blocks 
including shuffles.
   
   This is my focused use case and I love to hear your concerns. You have all 
my ears. 😄 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to