[ https://issues.apache.org/jira/browse/STORM-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15654599#comment-15654599 ]
Craig Hawco commented on STORM-2194: ------------------------------------ It does appear to me that the executor's {{report-error-and-die}} is the fn passed to the bolt creation multimethod {{mk-threads :bolt}}, via {{executor-data}}, and this is what gets invoked in the {{async-loop}} for the outer catch clause against {{Throwable}} if the exception is an {{InterruptedIOException}}, which then applies the above logic. I don't see how this behaviour could possibly be correct now, unless there's some other mechanism for causing workers to exit -- any bolt that throws an unhandled exception that has {{InterruptedIOException}} in the chain will invoke {{:report-error-and-die}}, which then happily logs and does nothing, leaving the process otherwise running. I'm fairly certain there is at least SOME bug here, so now the question is: # Do we just log and always suicide? STORM-772 and STORM-773 indicate that we DID actually want to ignore SOME of these exceptions. # Do we just log and ignore if and only if {{error}} is ITSELF an {{InterruptedException}}/{{InterruptedIOException}}? This now seems to be the most likely intent given the discussion in those two other tickets. # There's some other mechanism for making workers exit that I'm not aware -- entirely possible, but I'm not sure how I'd go about finding it if that's the case. > ReportErrorAndDie doesn't always die > ------------------------------------ > > Key: STORM-2194 > URL: https://issues.apache.org/jira/browse/STORM-2194 > Project: Apache Storm > Issue Type: Bug > Components: storm-core > Affects Versions: 2.0.0, 1.0.2 > Reporter: Craig Hawco > Time Spent: 20m > Remaining Estimate: 0h > > I've been trying to track down a cause of some of our issues with some > exceptions leaving Storm workers in a zombified state for some time. I > believe I've isolated the bug to the behaviour in > :report-error-and-die/reportErrorAndDie in the executor. Essentially: > {code} > :report-error-and-die (fn [error] > (try > ((:report-error <>) error) > (catch Exception e > (log-message "Error while reporting error to > cluster, proceeding with shutdown"))) > (if (or > (exception-cause? InterruptedException > error) > (exception-cause? > java.io.InterruptedIOException error)) > (log-message "Got interrupted excpetion > shutting thread down...") > ((:suicide-fn <>)))) > {code} > has the grouping for the if statement slightly wrong. It shouldn't log OR die > from InterruptedException/InterruptedIOException, but it should log under > that condition, and ALWAYS die. > Basically: > {code} > :report-error-and-die (fn [error] > (try > ((:report-error <>) error) > (catch Exception e > (log-message "Error while reporting error to > cluster, proceeding with shutdown"))) > (if (or > (exception-cause? InterruptedException > error) > (exception-cause? > java.io.InterruptedIOException error)) > (log-message "Got interrupted excpetion > shutting thread down...")) > ((:suicide-fn <>))) > {code} > After digging into the Java port of this code, it looks like a different bug > was introduced while porting: > {code} > if (Utils.exceptionCauseIsInstanceOf(InterruptedException.class, e) > || > Utils.exceptionCauseIsInstanceOf(java.io.InterruptedIOException.class, e)) { > LOG.info("Got interrupted exception shutting thread down..."); > suicideFn.run(); > } > {code} > Was how this was initially ported, and STORM-2142 changed this to: > {code} > if (Utils.exceptionCauseIsInstanceOf(InterruptedException.class, e) > || > Utils.exceptionCauseIsInstanceOf(java.io.InterruptedIOException.class, e)) { > LOG.info("Got interrupted exception shutting thread down..."); > } else { > suicideFn.run(); > } > {code} > However, I believe the correct port is as described above: > {code} > if (Utils.exceptionCauseIsInstanceOf(InterruptedException.class, e) > || > Utils.exceptionCauseIsInstanceOf(java.io.InterruptedIOException.class, e)) { > LOG.info("Got interrupted exception shutting thread down..."); > } > suicideFn.run(); > {code} > I'll look into providing patches for the 1.x and 2.x branches shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)