[ 
https://issues.apache.org/jira/browse/SPARK-15685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311149#comment-15311149
 ] 

Brett Randall commented on SPARK-15685:
---------------------------------------

Thanks [~srowen].

The {{SecurityManager}} is only in-play here for:

* Capturing the problem in a unit-test and making an assertion that Spark 
should not exit the JVM when running in local mode.  In the unit test it is 
removed again and replaced with the preexisting {{SecurityManager}} in the test 
finally block.
* As an unusual workaround in light of this problem, to prevent microservices 
server processes from being shutdown by Spark as it handles an error.

There's no suggestion here to install a {{SecurityManager}}.  That said, I 
don't expect it to have any performance impact - it only comes into play when 
some code attempts to call {{System.exit}}.  It shouldn't be necessary other 
than to block callers of {{System.exit()}}, and it would prevent an existing 
{{SecurityManager}}.

The long-running application here is a for example a microservice, 
orchestrating calculations which are performed by Spark.  So the service is 
long-running, but the Spark jobs are short-running asynchronous tasks.  This 
seems a reasonable use of Spark to me, short batch jobs running embedded in the 
local JVM, orchestrated by a long-running service.  For larger jobs, the 
service switches to use cluster mode and sends the Spark jobs off to a remote 
JVM e.g. via YARN.

I don't think it is typical for a calculations framework or batch-job framework 
to decide to shutdown the JVM with {{System.exit()}} rather than simply 
aborting the task thread and throwing some exception or notification up the 
stack for the caller to handle.

> StackOverflowError (VirtualMachineError) or NoClassDefFoundError 
> (LinkageError) should not System.exit() in local mode
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-15685
>                 URL: https://issues.apache.org/jira/browse/SPARK-15685
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.1
>            Reporter: Brett Randall
>            Priority: Critical
>
> Spark, when running in local mode, can encounter certain types of {{Error}} 
> exceptions in developer-code or third-party libraries and call 
> {{System.exit()}}, potentially killing a long-running JVM/service.  The 
> caller should decide on the exception-handling and whether the error should 
> be deemed fatal.
> *Consider this scenario:*
> * Spark is being used in local master mode within a long-running JVM 
> microservice, e.g. a Jetty instance.
> * A task is run.  The task errors with particular types of unchecked 
> throwables:
> ** a) there some bad code and/or bad data that exposes a bug where there's 
> unterminated recursion, leading to a {{StackOverflowError}}, or
> ** b) a particular not-often used function is called - there's a packaging 
> error with the service, a third-party library is missing some dependencies, a 
> {{NoClassDefFoundError}} is found.
> *Expected behaviour:* Since we are running in local mode, we might expect 
> some unchecked exception to be thrown, to be optionally-handled by the Spark 
> caller.  In the case of Jetty, a request thread or some other background 
> worker thread might handle the exception or not, the thread might exit or 
> note an error.  The caller should decide how the error is handled.
> *Actual behaviour:* {{System.exit()}} is called, the JVM exits and the Jetty 
> microservice is down and must be restarted.
> *Consequence:* Any local code or third-party library might cause Spark to 
> exit the long-running JVM/microservice, so Spark can be a problem in this 
> architecture.  I have seen this now on three separate occasions, leading to 
> service-down bug reports.
> *Analysis:*
> The line of code that seems to be the problem is: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L405
> {code}
> // Don't forcibly exit unless the exception was inherently fatal, to avoid
> // stopping other tasks unnecessarily.
> if (Utils.isFatalError(t)) {
>     SparkUncaughtExceptionHandler.uncaughtException(t)
> }
> {code}
> [Utils.isFatalError()|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L1818]
>  first excludes Scala 
> [NonFatal|https://github.com/scala/scala/blob/2.12.x/src/library/scala/util/control/NonFatal.scala#L31],
>  which excludes everything except {{VirtualMachineError}}, {{ThreadDeath}}, 
> {{InterruptedException}}, {{LinkageError}} and {{ControlThrowable}}.  
> {{Utils.isFatalError()}} further excludes {{InterruptedException}}, 
> {{NotImplementedError}} and {{ControlThrowable}}.
> Remaining are {{Error}} s such as {{StackOverflowError extends 
> VirtualMachineError}} or {{NoClassDefFoundError extends LinkageError}}, which 
> occur in the aforementioned scenarios.  
> {{SparkUncaughtExceptionHandler.uncaughtException()}} proceeds to call 
> {{System.exit()}}.
> [Further 
> up|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L77]
>  in in {{Executor}} we see exclusions for registering 
> {{SparkUncaughtExceptionHandler}} if in local mode:
> {code}
>   if (!isLocal) {
>     // Setup an uncaught exception handler for non-local mode.
>     // Make any thread terminations due to uncaught exceptions kill the entire
>     // executor process to avoid surprising stalls.
>     Thread.setDefaultUncaughtExceptionHandler(SparkUncaughtExceptionHandler)
>   }
> {code}
> This same exclusion must be applied for local mode for "fatal" errors - 
> cannot afford to shutdown the enclosing JVM (e.g. Jetty), the caller should 
> decide.
> A minimal test-case is supplied.  It installs a logging {{SecurityManager}} 
> to confirm that {{System.exit()}} was called from 
> {{SparkUncaughtExceptionHandler.uncaughtException}} via {{Executor}}.  It 
> also hints at the workaround - install your own {{SecurityManager}} and 
> inspect the current stack in {{checkExit()}} to prevent Spark from exiting 
> the JVM.
> Test-case: https://github.com/javabrett/SPARK-15685 .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to