Mark Hamstra created SPARK-17064:
------------------------------------
Summary: Reconsider spark.job.interruptOnCancel
Key: SPARK-17064
URL: https://issues.apache.org/jira/browse/SPARK-17064
Project: Spark
Issue Type: Improvement
Components: Scheduler, Spark Core
Reporter: Mark Hamstra
There is a frequent need or desire in Spark to cancel already running Tasks.
This has been recognized for a very long time (see, e.g., the ancient TODO
comment in the DAGScheduler: "Cancel running tasks in the stage"), but we've
never had more than an incomplete solution. Killing running Tasks at the
Executor level has been implemented by interrupting the threads running the
Tasks (taskThread.interrupt in o.a.s.scheduler.Task#kill.) Since
https://github.com/apache/spark/commit/432201c7ee9e1ea1d70a6418cbad1c5ad2653ed3
addressing https://issues.apache.org/jira/browse/SPARK-1582, interrupting Task
threads in this way has only been possible if interruptThread is true, and that
typically comes from the setting of the interruptOnCancel property in the
JobGroup, which in turn typically comes from the setting of
spark.job.interruptOnCancel. Because of concerns over
https://issues.apache.org/jira/browse/HDFS-1208 and the possibility of nodes
being marked dead when a Task thread is interrupted, the default value of the
boolean has been "false" -- i.e. by default we do not interrupt Tasks already
running on Executor even when the Task has been canceled in the DAGScheduler,
or the Stage has been abort, or the Job has been killed, etc.
There are several issues resulting from this current state of affairs, and they
each probably need to spawn their own JIRA issue and PR once we decide on an
overall strategy here. Among those issues:
* Is HDFS-1208 still an issue, or has it been resolved adequately in the HDFS
versions that Spark now supports so that we set the default value of
spark.job.interruptOnCancel to "true" or eliminate this boolean flag entirely?
* Even if interrupting Task threads is no longer an issue for HDFS, is it still
enough of an issue for non-HDFS usage (e.g. Cassandra) so that we still need
protection similar to what the current default value of
spark.job.interruptOnCancel provides?
* If interrupting Task threads isn't safe enough, what should we do instead?
* Once we have a safe mechanism to stop and clean up after already executing
Tasks, there is still the question of whether we _should_ end executing Tasks.
While that is likely a good thing to do in cases where individual Tasks are
lightweight in terms of resource usage, at least in some cases not all running
Tasks should be ended: https://github.com/apache/spark/pull/12436 That means
that we probably need to continue to make allowing Task interruption
configurable at the Job or JobGroup level (and we need better documentation
explaining how and when to allow interruption or not.)
* There is one place in the current code (TaskSetManager#handleSuccessfulTask)
that hard codes interruptThread to "true". This should be fixed, and similar
misuses of killTask be denied in pull requests until this issue is adequately
resolved.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]