Mark Hamstra created SPARK-17064:
------------------------------------

             Summary: Reconsider spark.job.interruptOnCancel
                 Key: SPARK-17064
                 URL: https://issues.apache.org/jira/browse/SPARK-17064
             Project: Spark
          Issue Type: Improvement
          Components: Scheduler, Spark Core
            Reporter: Mark Hamstra


There is a frequent need or desire in Spark to cancel already running Tasks.  
This has been recognized for a very long time (see, e.g., the ancient TODO 
comment in the DAGScheduler: "Cancel running tasks in the stage"), but we've 
never had more than an incomplete solution.  Killing running Tasks at the 
Executor level has been implemented by interrupting the threads running the 
Tasks (taskThread.interrupt in o.a.s.scheduler.Task#kill.) Since 
https://github.com/apache/spark/commit/432201c7ee9e1ea1d70a6418cbad1c5ad2653ed3 
addressing https://issues.apache.org/jira/browse/SPARK-1582, interrupting Task 
threads in this way has only been possible if interruptThread is true, and that 
typically comes from the setting of the interruptOnCancel property in the 
JobGroup, which in turn typically comes from the setting of 
spark.job.interruptOnCancel.  Because of concerns over 
https://issues.apache.org/jira/browse/HDFS-1208 and the possibility of nodes 
being marked dead when a Task thread is interrupted, the default value of the 
boolean has been "false" -- i.e. by default we do not interrupt Tasks already 
running on Executor even when the Task has been canceled in the DAGScheduler, 
or the Stage has been abort, or the Job has been killed, etc.

There are several issues resulting from this current state of affairs, and they 
each probably need to spawn their own JIRA issue and PR once we decide on an 
overall strategy here.  Among those issues:

* Is HDFS-1208 still an issue, or has it been resolved adequately in the HDFS 
versions that Spark now supports so that we set the default value of 
spark.job.interruptOnCancel to "true" or eliminate this boolean flag entirely?
* Even if interrupting Task threads is no longer an issue for HDFS, is it still 
enough of an issue for non-HDFS usage (e.g. Cassandra) so that we still need 
protection similar to what the current default value of 
spark.job.interruptOnCancel provides?
* If interrupting Task threads isn't safe enough, what should we do instead?
* Once we have a safe mechanism to stop and clean up after already executing 
Tasks, there is still the question of whether we _should_ end executing Tasks.  
While that is likely a good thing to do in cases where individual Tasks are 
lightweight in terms of resource usage, at least in some cases not all running 
Tasks should be ended: https://github.com/apache/spark/pull/12436  That means 
that we probably need to continue to make allowing Task interruption 
configurable at the Job or JobGroup level (and we need better documentation 
explaining how and when to allow interruption or not.)
* There is one place in the current code (TaskSetManager#handleSuccessfulTask) 
that hard codes interruptThread to "true".  This should be fixed, and similar 
misuses of killTask be denied in pull requests until this issue is adequately 
resolved.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to