[ 
https://issues.apache.org/jira/browse/STORM-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14340333#comment-14340333
 ] 

Michael Noll commented on STORM-695:
------------------------------------

To fix the storm CLI script:  Essentially we must updated the 
{{exec_storm_class()}} function to understand and pass exit codes of the 
commands it ran back to the caller.

Idea:

{code}
$ diff -u storm storm.fixed
--- storm       2015-02-24 13:59:03.000000000 +0000
+++ storm.fixed 2015-02-27 15:47:38.373536529 +0000
@@ -179,11 +179,13 @@
         "-cp", get_classpath(extrajars),
     ] + jvmopts + [klass] + list(args)
     print("Running: " + " ".join(all_args))
+    return_code = 1 # default return code indicates failure
     if fork:
-        os.spawnvp(os.P_WAIT, JAVA_CMD, all_args)
+        return_code = os.spawnvp(os.P_WAIT, JAVA_CMD, all_args)
     else:
         # handling whitespaces in JAVA_CMD
-        sub.call(all_args)
+        return_code = sub.call(all_args)
+    return return_code

 def jar(jarfile, klass, *args):
     """Syntax: [storm jar topology-jar-path class ...]
@@ -490,7 +492,7 @@
 def unknown_command(*args):
     print("Unknown command: [storm %s]" % ' '.join(sys.argv[1:]))
     print_usage()
-    sys.exit(254)
+    return 254

 COMMANDS = {"jar": jar, "kill": kill, "shell": shell, "nimbus": nimbus, "ui": 
ui, "logviewer": logviewer,
             "drpc": drpc, "supervisor": supervisor, "localconfvalue": 
print_localconfvalue,
@@ -532,7 +534,8 @@
     parse_config(config_list)
     COMMAND = args[0]
     ARGS = args[1:]
-    (COMMANDS.get(COMMAND, unknown_command))(*ARGS)
+    return_code = (COMMANDS.get(COMMAND, unknown_command))(*ARGS)
+    sys.exit(return_code)

 if __name__ == "__main__":
     main()
{code}

As I said above this diff/patch alone is not sufficient because the Storm 
commands themselves (e.g. kill_topology.clj) are not returning proper 
success/failure information.

> storm CLI tool reports zero exit code on error scenario, take 2
> ---------------------------------------------------------------
>
>                 Key: STORM-695
>                 URL: https://issues.apache.org/jira/browse/STORM-695
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 0.9.3
>            Reporter: Michael Noll
>            Priority: Minor
>
> Commands such as "storm kill non-existing-topology" will return an exit code 
> of zero, indicating success when in fact the command failed.
> h3. How to reproduce
> Here is but one example where the {{storm}} CLI tool violates shell best 
> practices:
> {code}
> # Let's kill a topology that is in fact not running in the cluster.
> $ storm kill i-do-not-exist-topo
> <snip>
> Exception in thread "main" NotAliveException(msg:i-do-not-exist-topo is not 
> alive)
> # Print the exit code of last command.
> $ echo $?
> 0  # <<< but since the kill command failed this should be non-zero!
> {code}
> Another example is the "storm jar" command.  If you attempt to submit a 
> topology that has the same name as an existing, running topology, the "storm 
> jar" command will not submit the topology -- instead it will print an 
> exception (think: "the topology FooName is already running"), which is ok, 
> but it will then exit with a return code of zero, which indicates success 
> (which is wrong).
> h3. Impact
> This bug prevents automated deployment tools such as Ansible or Puppet as 
> well as ad-hoc CLI scripting (think: fire-fighting Ops teams) to work 
> properly because Storm violates shell conventions by not returning non-zero 
> exit codes in case of failures.
> h3. How to fix
> From what I understand the solution is two-fold:
> # The various Storm commands that are being called by the {{storm}} script 
> must return proper exit codes.
> # The {{storm}} script must store these exit codes and return itself with the 
> respective exit code of the Storm command it actually ran.
> For example, here's the current code that implements the "storm kill" command:
> {code}
> # In: bin/storm
> def kill(*args):
>     """Syntax: [storm kill topology-name [-w wait-time-secs]]
>     Kills the topology with the name topology-name. Storm will 
>     first deactivate the topology's spouts for the duration of 
>     the topology's message timeout to allow all messages currently 
>     being processed to finish processing. Storm will then shutdown 
>     the workers and clean up their state. You can override the length 
>     of time Storm waits between deactivation and shutdown with the -w flag.
>     """
>     exec_storm_class(
>         "backtype.storm.command.kill_topology", 
>         args=args, 
>         jvmtype="-client", 
>         extrajars=[USER_CONF_DIR, STORM_BIN_DIR])
> {code}
> which in turn calls the following code in {{kill_topology.clj}}:
> {code}
> ;; In: backtype.storm.command.kill-topology
> (ns backtype.storm.command.kill-topology
>   (:use [clojure.tools.cli :only [cli]])
>   (:use [backtype.storm thrift config log])
>   (:import [backtype.storm.generated KillOptions])
>   (:gen-class))
> (defn -main [& args]
>   (let [[{wait :wait} [name] _] (cli args ["-w" "--wait" :default nil 
> :parse-fn #(Integer/parseInt %)])
>         opts (KillOptions.)]
>     (if wait (.set_wait_secs opts wait))
>     (with-configured-nimbus-connection nimbus
>       (.killTopologyWithOpts nimbus name opts)
>       (log-message "Killed topology: " name)
>       )))
> {code}
> which in turn calls the following code in {{nimbus.clj}}:
> {code}
> ;; In: backtype.storm.daemon.nimbus
>       (^void killTopologyWithOpts [this ^String storm-name ^KillOptions 
> options]
>         (check-storm-active! nimbus storm-name true)
>         (let [topology-conf (try-read-storm-conf-from-name conf storm-name 
> nimbus)]
>           (check-authorization! nimbus storm-name topology-conf 
> "killTopology"))
>         (let [wait-amt (if (.is_set_wait_secs options)
>                          (.get_wait_secs options)                         
>                          )]
>           (transition-name! nimbus storm-name [:kill wait-amt] true)
>           ))
> {code}
> As you can see the current implementation does not pass success/failure 
> information back to the caller.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to