[jira] [Commented] (SPARK-17064) Reconsider spark.job.interruptOnCancel

Mark Hamstra (JIRA) Mon, 15 Aug 2016 12:15:37 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-17064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15421503#comment-15421503
 ]


Mark Hamstra commented on SPARK-17064:
--------------------------------------

[~kayousterhout] [[email protected]] [~imranr]

> Reconsider spark.job.interruptOnCancel
> --------------------------------------
>
>                 Key: SPARK-17064
>                 URL: https://issues.apache.org/jira/browse/SPARK-17064
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler, Spark Core
>            Reporter: Mark Hamstra
>
> There is a frequent need or desire in Spark to cancel already running Tasks.  
> This has been recognized for a very long time (see, e.g., the ancient TODO 
> comment in the DAGScheduler: "Cancel running tasks in the stage"), but we've 
> never had more than an incomplete solution.  Killing running Tasks at the 
> Executor level has been implemented by interrupting the threads running the 
> Tasks (taskThread.interrupt in o.a.s.scheduler.Task#kill.) Since 
> https://github.com/apache/spark/commit/432201c7ee9e1ea1d70a6418cbad1c5ad2653ed3
>  addressing https://issues.apache.org/jira/browse/SPARK-1582, interrupting 
> Task threads in this way has only been possible if interruptThread is true, 
> and that typically comes from the setting of the interruptOnCancel property 
> in the JobGroup, which in turn typically comes from the setting of 
> spark.job.interruptOnCancel.  Because of concerns over 
> https://issues.apache.org/jira/browse/HDFS-1208 and the possibility of nodes 
> being marked dead when a Task thread is interrupted, the default value of the 
> boolean has been "false" -- i.e. by default we do not interrupt Tasks already 
> running on Executor even when the Task has been canceled in the DAGScheduler, 
> or the Stage has been abort, or the Job has been killed, etc.
> There are several issues resulting from this current state of affairs, and 
> they each probably need to spawn their own JIRA issue and PR once we decide 
> on an overall strategy here.  Among those issues:
> * Is HDFS-1208 still an issue, or has it been resolved adequately in the HDFS 
> versions that Spark now supports so that we set the default value of 
> spark.job.interruptOnCancel to "true" or eliminate this boolean flag entirely?
> * Even if interrupting Task threads is no longer an issue for HDFS, is it 
> still enough of an issue for non-HDFS usage (e.g. Cassandra) so that we still 
> need protection similar to what the current default value of 
> spark.job.interruptOnCancel provides?
> * If interrupting Task threads isn't safe enough, what should we do instead?
> * Once we have a safe mechanism to stop and clean up after already executing 
> Tasks, there is still the question of whether we _should_ end executing 
> Tasks.  While that is likely a good thing to do in cases where individual 
> Tasks are lightweight in terms of resource usage, at least in some cases not 
> all running Tasks should be ended: https://github.com/apache/spark/pull/12436 
>  That means that we probably need to continue to make allowing Task 
> interruption configurable at the Job or JobGroup level (and we need better 
> documentation explaining how and when to allow interruption or not.)
> * There is one place in the current code 
> (TaskSetManager#handleSuccessfulTask) that hard codes interruptThread to 
> "true".  This should be fixed, and similar misuses of killTask be denied in 
> pull requests until this issue is adequately resolved.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-17064) Reconsider spark.job.interruptOnCancel

Reply via email to