Andrew Montalenti created STORM-204:
---------------------------------------

             Summary: Local topologies that throw exceptions can crash the REPL
                 Key: STORM-204
                 URL: https://issues.apache.org/jira/browse/STORM-204
             Project: Apache Storm (Incubating)
          Issue Type: Bug
         Environment: $ java -version
java version "1.7.0_45"
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)
            Reporter: Andrew Montalenti
            Priority: Minor


We've been testing some Storm topologies using "lein repl". Based on the 
storm-starter project, our local testing environment offers a Clojure function 
(run-local!), which builts up the topology using the Clojure DSL and runs it. 
The issue is that there are conditions that can cause topology to fail; the 
exception thrown then crashes the Storm worker, which, in turn, crashes the 
lein repl itself. This is frustrating since part of the purpose of running a 
local cluster is to test it from something like a REPL.

I think I've narrowed down what's going wrong. You can see an example session 
with stacktrace here:

https://gist.github.com/amontalenti/8677464#file-example_storm_crashing_lein-txt-L142-L173

The way I created this error is by renaming my Python module before running the 
topology, so that the Python file could not be found.

I think what's going on is that the ShellBolt is throwing a RuntimeException, 
which is uncaught by whatever is running the ShellBolt. This, in turn, crashes 
the worker: "Error when launching multilang subprocess". The executor notices 
that the worker dies, but is a bit zealous and decides to call (halt-process!) 
on it. Under the hood, halt-process! uses Runtime.getRuntime#halt, which is a 
forcible kill of the running JVM. Since the JVM, in this case, is "lein repl", 
I think this is what ultimately kills the REPL.

http://docs.oracle.com/javase/7/docs/api/java/lang/Runtime.html#halt(int)

I suppose there are two possible fixes here. One is to make the Storm worker a 
little more resilient to a misconfigured ShellBolt. The other is to make the 
halt-process! call not run when a REPL environment is detected.

I am glad to work on either of these fixes once the issue is confirmed and a 
path forward is suggested.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to