Michael Noll created STORM-695:
----------------------------------

             Summary: storm CLI tool reports zero exit code on error scenario, 
take 2
                 Key: STORM-695
                 URL: https://issues.apache.org/jira/browse/STORM-695
             Project: Apache Storm
          Issue Type: Bug
    Affects Versions: 0.9.3
            Reporter: Michael Noll
            Priority: Minor


Commands such as "storm kill non-existing-topology" will return an exit code of 
zero, indicating success when in fact the command failed.

h3. How to reproduce

Here is but one example where the {{storm}} CLI tool violates shell best 
practices:

{code}
# Let's kill a topology that is in fact not running in the cluster.
$ storm kill i-do-not-exist-topo
<snip>
Exception in thread "main" NotAliveException(msg:i-do-not-exist-topo is not 
alive)

# Print the exit code of last command.
$ echo $?
0  # <<< but since the kill command failed this should be non-zero!
{code}

Another example is the "storm jar" command.  If you attempt to submit a 
topology that has the same name as an existing, running topology, the "storm 
jar" command will not submit the topology -- instead it will print an exception 
(think: "the topology FooName is already running"), which is ok, but it will 
then exit with a return code of zero, which indicates success (which is wrong).

h3. Impact

This bug prevents automated deployment tools such as Ansible or Puppet as well 
as ad-hoc CLI scripting (think: fire-fighting Ops teams) to work properly 
because Storm violates shell conventions by not returning non-zero exit codes 
in case of failures.

h3. How to fix

>From what I understand the solution is two-fold:

# The various Storm commands that are being called by the {{storm}} script must 
return proper exit codes.
# The {{storm}} script must store these exit codes and return itself with the 
respective exit code of the Storm command it actually ran.

For example, here's the current code that implements the "storm kill" command:

{code}
# In: bin/storm
def kill(*args):
    """Syntax: [storm kill topology-name [-w wait-time-secs]]

    Kills the topology with the name topology-name. Storm will 
    first deactivate the topology's spouts for the duration of 
    the topology's message timeout to allow all messages currently 
    being processed to finish processing. Storm will then shutdown 
    the workers and clean up their state. You can override the length 
    of time Storm waits between deactivation and shutdown with the -w flag.
    """
    exec_storm_class(
        "backtype.storm.command.kill_topology", 
        args=args, 
        jvmtype="-client", 
        extrajars=[USER_CONF_DIR, STORM_BIN_DIR])
{code}

which in turn calls the following code in {{kill_topology.clj}}:

{code}
;; In: backtype.storm.command.kill-topology
(ns backtype.storm.command.kill-topology
  (:use [clojure.tools.cli :only [cli]])
  (:use [backtype.storm thrift config log])
  (:import [backtype.storm.generated KillOptions])
  (:gen-class))

(defn -main [& args]
  (let [[{wait :wait} [name] _] (cli args ["-w" "--wait" :default nil :parse-fn 
#(Integer/parseInt %)])
        opts (KillOptions.)]
    (if wait (.set_wait_secs opts wait))
    (with-configured-nimbus-connection nimbus
      (.killTopologyWithOpts nimbus name opts)
      (log-message "Killed topology: " name)
      )))
{code}

which in turn calls the following code in {{nimbus.clj}}:

{code}
;; In: backtype.storm.daemon.nimbus
      (^void killTopologyWithOpts [this ^String storm-name ^KillOptions options]
        (check-storm-active! nimbus storm-name true)
        (let [topology-conf (try-read-storm-conf-from-name conf storm-name 
nimbus)]
          (check-authorization! nimbus storm-name topology-conf "killTopology"))
        (let [wait-amt (if (.is_set_wait_secs options)
                         (.get_wait_secs options)                         
                         )]
          (transition-name! nimbus storm-name [:kill wait-amt] true)
          ))
{code}

As you can see the current implementation does not pass success/failure 
information back to the caller.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to