vanzin commented on a change in pull request #24817: [WIP][SPARK-27963][core] Allow dynamic allocation without a shuffle service. URL: https://github.com/apache/spark/pull/24817#discussion_r296927131
########## File path: core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala ########## @@ -64,6 +69,26 @@ private[spark] class ExecutorMonitor( private val nextTimeout = new AtomicLong(Long.MaxValue) private var timedOutExecs = Seq.empty[String] + // Active job tracking. + // + // The following state is used when an external shuffle service is not in use, and allows Spark + // to scale down based on whether the shuffle data stored in executors is in use. + // + // The algorithm works as following: when jobs start, some state is kept that tracks which stages + // are part of that job, and which shuffle ID is attached to those stages. As tasks finish, the + // executor tracking code is updated to include the list of shuffles for which it's storing + // shuffle data. + // + // If executors hold shuffle data that is related to an active job, then the executor is + // considered to be in "shuffle busy" state; meaning that the executor is not allowed to be + // removed. If the executor has shuffle data but it doesn't relate to any active job, then it + // may be removed when idle, following the same timeout configuration used for cache blocks. Review comment: > Getting this timeout right becomes use case dependent but also adds another headache to the configuration options people need to think of Actually the heuristic is written in a way that for usual applications, this timeout shouldn't matter much. The context cleaner runs the GC periodically, which would clean up shuffles related to orphaned RDDs. Normal application run would also cause these orphaned RDDs to be collected. So for all practical purposes, in a normal application the executor would become idle soon after jobs are run (doubly so for SQL jobs, which don't seem to reuse the underlying RDDs). For shells this is a little more tricky because of the way the shell itself holds on to RDD references. So if you're directly playing with RDDs in a shell this may not perform as well, and then you'd be relying on the timeout. But, like the above, for SQL jobs you'd still end up with idle executors more often than not. But, on a side note, this comment is inaccurate since I ended up adding a separate timeout for the shuffle. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
