Michael Noll created STORM-695:
----------------------------------
Summary: storm CLI tool reports zero exit code on error scenario,
take 2
Key: STORM-695
URL: https://issues.apache.org/jira/browse/STORM-695
Project: Apache Storm
Issue Type: Bug
Affects Versions: 0.9.3
Reporter: Michael Noll
Priority: Minor
Commands such as "storm kill non-existing-topology" will return an exit code of
zero, indicating success when in fact the command failed.
h3. How to reproduce
Here is but one example where the {{storm}} CLI tool violates shell best
practices:
{code}
# Let's kill a topology that is in fact not running in the cluster.
$ storm kill i-do-not-exist-topo
<snip>
Exception in thread "main" NotAliveException(msg:i-do-not-exist-topo is not
alive)
# Print the exit code of last command.
$ echo $?
0 # <<< but since the kill command failed this should be non-zero!
{code}
Another example is the "storm jar" command. If you attempt to submit a
topology that has the same name as an existing, running topology, the "storm
jar" command will not submit the topology -- instead it will print an exception
(think: "the topology FooName is already running"), which is ok, but it will
then exit with a return code of zero, which indicates success (which is wrong).
h3. Impact
This bug prevents automated deployment tools such as Ansible or Puppet as well
as ad-hoc CLI scripting (think: fire-fighting Ops teams) to work properly
because Storm violates shell conventions by not returning non-zero exit codes
in case of failures.
h3. How to fix
>From what I understand the solution is two-fold:
# The various Storm commands that are being called by the {{storm}} script must
return proper exit codes.
# The {{storm}} script must store these exit codes and return itself with the
respective exit code of the Storm command it actually ran.
For example, here's the current code that implements the "storm kill" command:
{code}
# In: bin/storm
def kill(*args):
"""Syntax: [storm kill topology-name [-w wait-time-secs]]
Kills the topology with the name topology-name. Storm will
first deactivate the topology's spouts for the duration of
the topology's message timeout to allow all messages currently
being processed to finish processing. Storm will then shutdown
the workers and clean up their state. You can override the length
of time Storm waits between deactivation and shutdown with the -w flag.
"""
exec_storm_class(
"backtype.storm.command.kill_topology",
args=args,
jvmtype="-client",
extrajars=[USER_CONF_DIR, STORM_BIN_DIR])
{code}
which in turn calls the following code in {{kill_topology.clj}}:
{code}
;; In: backtype.storm.command.kill-topology
(ns backtype.storm.command.kill-topology
(:use [clojure.tools.cli :only [cli]])
(:use [backtype.storm thrift config log])
(:import [backtype.storm.generated KillOptions])
(:gen-class))
(defn -main [& args]
(let [[{wait :wait} [name] _] (cli args ["-w" "--wait" :default nil :parse-fn
#(Integer/parseInt %)])
opts (KillOptions.)]
(if wait (.set_wait_secs opts wait))
(with-configured-nimbus-connection nimbus
(.killTopologyWithOpts nimbus name opts)
(log-message "Killed topology: " name)
)))
{code}
which in turn calls the following code in {{nimbus.clj}}:
{code}
;; In: backtype.storm.daemon.nimbus
(^void killTopologyWithOpts [this ^String storm-name ^KillOptions options]
(check-storm-active! nimbus storm-name true)
(let [topology-conf (try-read-storm-conf-from-name conf storm-name
nimbus)]
(check-authorization! nimbus storm-name topology-conf "killTopology"))
(let [wait-amt (if (.is_set_wait_secs options)
(.get_wait_secs options)
)]
(transition-name! nimbus storm-name [:kill wait-amt] true)
))
{code}
As you can see the current implementation does not pass success/failure
information back to the caller.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)