[
https://issues.apache.org/jira/browse/SPARK-17064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15421503#comment-15421503
]
Mark Hamstra commented on SPARK-17064:
--------------------------------------
[~kayousterhout] [[email protected]] [~imranr]
> Reconsider spark.job.interruptOnCancel
> --------------------------------------
>
> Key: SPARK-17064
> URL: https://issues.apache.org/jira/browse/SPARK-17064
> Project: Spark
> Issue Type: Improvement
> Components: Scheduler, Spark Core
> Reporter: Mark Hamstra
>
> There is a frequent need or desire in Spark to cancel already running Tasks.
> This has been recognized for a very long time (see, e.g., the ancient TODO
> comment in the DAGScheduler: "Cancel running tasks in the stage"), but we've
> never had more than an incomplete solution. Killing running Tasks at the
> Executor level has been implemented by interrupting the threads running the
> Tasks (taskThread.interrupt in o.a.s.scheduler.Task#kill.) Since
> https://github.com/apache/spark/commit/432201c7ee9e1ea1d70a6418cbad1c5ad2653ed3
> addressing https://issues.apache.org/jira/browse/SPARK-1582, interrupting
> Task threads in this way has only been possible if interruptThread is true,
> and that typically comes from the setting of the interruptOnCancel property
> in the JobGroup, which in turn typically comes from the setting of
> spark.job.interruptOnCancel. Because of concerns over
> https://issues.apache.org/jira/browse/HDFS-1208 and the possibility of nodes
> being marked dead when a Task thread is interrupted, the default value of the
> boolean has been "false" -- i.e. by default we do not interrupt Tasks already
> running on Executor even when the Task has been canceled in the DAGScheduler,
> or the Stage has been abort, or the Job has been killed, etc.
> There are several issues resulting from this current state of affairs, and
> they each probably need to spawn their own JIRA issue and PR once we decide
> on an overall strategy here. Among those issues:
> * Is HDFS-1208 still an issue, or has it been resolved adequately in the HDFS
> versions that Spark now supports so that we set the default value of
> spark.job.interruptOnCancel to "true" or eliminate this boolean flag entirely?
> * Even if interrupting Task threads is no longer an issue for HDFS, is it
> still enough of an issue for non-HDFS usage (e.g. Cassandra) so that we still
> need protection similar to what the current default value of
> spark.job.interruptOnCancel provides?
> * If interrupting Task threads isn't safe enough, what should we do instead?
> * Once we have a safe mechanism to stop and clean up after already executing
> Tasks, there is still the question of whether we _should_ end executing
> Tasks. While that is likely a good thing to do in cases where individual
> Tasks are lightweight in terms of resource usage, at least in some cases not
> all running Tasks should be ended: https://github.com/apache/spark/pull/12436
> That means that we probably need to continue to make allowing Task
> interruption configurable at the Job or JobGroup level (and we need better
> documentation explaining how and when to allow interruption or not.)
> * There is one place in the current code
> (TaskSetManager#handleSuccessfulTask) that hard codes interruptThread to
> "true". This should be fixed, and similar misuses of killTask be denied in
> pull requests until this issue is adequately resolved.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]