[ https://issues.apache.org/jira/browse/STORM-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15654519#comment-15654519 ]
Craig Hawco commented on STORM-2194: ------------------------------------ Will do, however after looking through STORM-772 and STORM-773, it's not entirely clear what the intent was. Basically, my understand is that the exception passed here is the exception from the worker -- and {{exception-cause?}} walks the entire exception chain checking for {{InterruptedException}} or {{InterruptedIOException}} -- which may have been due to a bolt talking to an external service (e.g. {{SocketTimeoutException}} is an {{InterruptedIOException}}). So, some options: # My understanding is incorrect, and this error is NOT from the bolt itself # {{exception-cause?}} doesn't walk the entire exception tree as I thought # This behaviour is mostly intentional, but should only be applied if error itself is {{InterruptedException}} or {{InterruptedIOException}}, but not if those appear anywhere in the chain I think #3 above isn't actually case: {code} user=> (def ex (new RuntimeException (new InterruptedException))) #'user/ex (defn exception-cause? [klass ^Throwable t] (->> (iterate #(.getCause ^Throwable %) t) (take-while identity) (some (partial instance? klass)) boolean)) (->> (iterate #(.getCause ^Throwable %) t) (take-while identity) (some (partial instance? klass)) boolean)) (take-while identity) (some (partial instance? klass)) boolean)) #'user/exception-cause? user=> (exception-cause? InterruptedException ex) true user=> {code} So, either the exception isn't the one raised by the bolt (I'll look for some evidence to support this next), the check should only be ignoring errors that are themselves {{InterruptedException}}/{{InterruptedIOException}} but not looking elsewhere in the chain, or this was intended as just a logging line, and {{sucide-fn}} should still get invoked after logging. > ReportErrorAndDie doesn't always die > ------------------------------------ > > Key: STORM-2194 > URL: https://issues.apache.org/jira/browse/STORM-2194 > Project: Apache Storm > Issue Type: Bug > Components: storm-core > Affects Versions: 2.0.0, 1.0.2 > Reporter: Craig Hawco > Time Spent: 20m > Remaining Estimate: 0h > > I've been trying to track down a cause of some of our issues with some > exceptions leaving Storm workers in a zombified state for some time. I > believe I've isolated the bug to the behaviour in > :report-error-and-die/reportErrorAndDie in the executor. Essentially: > {code} > :report-error-and-die (fn [error] > (try > ((:report-error <>) error) > (catch Exception e > (log-message "Error while reporting error to > cluster, proceeding with shutdown"))) > (if (or > (exception-cause? InterruptedException > error) > (exception-cause? > java.io.InterruptedIOException error)) > (log-message "Got interrupted excpetion > shutting thread down...") > ((:suicide-fn <>)))) > {code} > has the grouping for the if statement slightly wrong. It shouldn't log OR die > from InterruptedException/InterruptedIOException, but it should log under > that condition, and ALWAYS die. > Basically: > {code} > :report-error-and-die (fn [error] > (try > ((:report-error <>) error) > (catch Exception e > (log-message "Error while reporting error to > cluster, proceeding with shutdown"))) > (if (or > (exception-cause? InterruptedException > error) > (exception-cause? > java.io.InterruptedIOException error)) > (log-message "Got interrupted excpetion > shutting thread down...")) > ((:suicide-fn <>))) > {code} > After digging into the Java port of this code, it looks like a different bug > was introduced while porting: > {code} > if (Utils.exceptionCauseIsInstanceOf(InterruptedException.class, e) > || > Utils.exceptionCauseIsInstanceOf(java.io.InterruptedIOException.class, e)) { > LOG.info("Got interrupted exception shutting thread down..."); > suicideFn.run(); > } > {code} > Was how this was initially ported, and STORM-2142 changed this to: > {code} > if (Utils.exceptionCauseIsInstanceOf(InterruptedException.class, e) > || > Utils.exceptionCauseIsInstanceOf(java.io.InterruptedIOException.class, e)) { > LOG.info("Got interrupted exception shutting thread down..."); > } else { > suicideFn.run(); > } > {code} > However, I believe the correct port is as described above: > {code} > if (Utils.exceptionCauseIsInstanceOf(InterruptedException.class, e) > || > Utils.exceptionCauseIsInstanceOf(java.io.InterruptedIOException.class, e)) { > LOG.info("Got interrupted exception shutting thread down..."); > } > suicideFn.run(); > {code} > I'll look into providing patches for the 1.x and 2.x branches shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)