dongjoon-hyun edited a comment on pull request #30876:
URL: https://github.com/apache/spark/pull/30876#issuecomment-750471287
Thank you, @mridulm . Sure!
In terms of resource management, currently `K8s environment` is more
aggressive than the other existing resource managers. For example, not only the
competition between heterogeneous apps(Spark/Hive/other kind of jobs) in the
cluster occurs, but also the cluster size itself is dynamically adjusted.
Your recent series of external shuffle service patches are helpful of
course. In addition to that, there are more helpful options like this. For
example, there is `Worker Decommission` and its
`spark.storage.decommission.rddBlocks.enabled`. If it's enabled,
`BlockManagerDecommissioner`'s `rddBlockMigrationRunnable` is trying to migrate
all RDD blocks by using `BlockManager.replicateBlock`. It's logically similar
with `spark.storage.replication.proactive=true`.
```scala
private def migrateBlock(blockToReplicate: ReplicateBlock): Boolean = {
val replicatedSuccessfully = bm.replicateBlock(
blockToReplicate.blockId,
blockToReplicate.replicas.toSet,
blockToReplicate.maxReplicas,
maxReplicationFailures = Some(maxReplicationFailuresForDecommission))
if (replicatedSuccessfully) {
logInfo(s"Block ${blockToReplicate.blockId} offloaded successfully,
Removing block now")
bm.removeBlock(blockToReplicate.blockId)
logInfo(s"Block ${blockToReplicate.blockId} removed")
} else {
logWarning(s"Failed to offload block ${blockToReplicate.blockId}")
}
replicatedSuccessfully
}
```
However, sometimes, K8s doesn't wait for an enough time and kill executors
during migration processing at its predefined grace period. At that time,
`spark.storage.replication.proactive` become important again to recover it.
`spark.storage.replication.proactive` is much less chatty than
`BlockManagerDecommissioner` which tries to migrate all storage blocks
including shuffles.
This is my focused use case and I love to hear your concerns. You have all
my ears. 😄
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]