[ 
https://issues.apache.org/jira/browse/STORM-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15654599#comment-15654599
 ] 

Craig Hawco commented on STORM-2194:
------------------------------------

It does appear to me that the executor's {{report-error-and-die}} is the fn 
passed to the bolt creation multimethod {{mk-threads :bolt}}, via 
{{executor-data}}, and this is what gets invoked in the {{async-loop}} for the 
outer catch clause against {{Throwable}} if the exception is an 
{{InterruptedIOException}}, which then applies the above logic. 

I don't see how this behaviour could possibly be correct now, unless there's 
some other mechanism for causing workers to exit -- any bolt that throws an 
unhandled exception that has {{InterruptedIOException}} in the chain will 
invoke {{:report-error-and-die}}, which then happily logs and does nothing, 
leaving the process otherwise running.

I'm fairly certain there is at least SOME bug here, so now the question is:

# Do we just log and always suicide? STORM-772 and STORM-773 indicate that we 
DID actually want to ignore SOME of these exceptions.
# Do we just log and ignore if and only if {{error}} is ITSELF an 
{{InterruptedException}}/{{InterruptedIOException}}? This now seems to be the 
most likely intent given the discussion in those two other tickets.
# There's some other mechanism for making workers exit that I'm not aware -- 
entirely possible, but I'm not sure how I'd go about finding it if that's the 
case.

> ReportErrorAndDie doesn't always die
> ------------------------------------
>
>                 Key: STORM-2194
>                 URL: https://issues.apache.org/jira/browse/STORM-2194
>             Project: Apache Storm
>          Issue Type: Bug
>          Components: storm-core
>    Affects Versions: 2.0.0, 1.0.2
>            Reporter: Craig Hawco
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> I've been trying to track down a cause of some of our issues with some 
> exceptions leaving Storm workers in a zombified state for some time. I 
> believe I've isolated the bug to the behaviour in 
> :report-error-and-die/reportErrorAndDie in the executor. Essentially:
> {code}
>      :report-error-and-die (fn [error]
>                              (try
>                                ((:report-error <>) error)
>                                (catch Exception e
>                                  (log-message "Error while reporting error to 
> cluster, proceeding with shutdown")))
>                              (if (or
>                                     (exception-cause? InterruptedException 
> error)
>                                     (exception-cause? 
> java.io.InterruptedIOException error))
>                                (log-message "Got interrupted excpetion 
> shutting thread down...")
>                                ((:suicide-fn <>))))
> {code}
> has the grouping for the if statement slightly wrong. It shouldn't log OR die 
> from InterruptedException/InterruptedIOException, but it should log under 
> that condition, and ALWAYS die. 
> Basically:
> {code}
>      :report-error-and-die (fn [error]
>                              (try
>                                ((:report-error <>) error)
>                                (catch Exception e
>                                  (log-message "Error while reporting error to 
> cluster, proceeding with shutdown")))
>                              (if (or
>                                     (exception-cause? InterruptedException 
> error)
>                                     (exception-cause? 
> java.io.InterruptedIOException error))
>                                (log-message "Got interrupted excpetion 
> shutting thread down..."))
>                              ((:suicide-fn <>)))
> {code}
> After digging into the Java port of this code, it looks like a different bug 
> was introduced while porting:
> {code}
>         if (Utils.exceptionCauseIsInstanceOf(InterruptedException.class, e)
>                 || 
> Utils.exceptionCauseIsInstanceOf(java.io.InterruptedIOException.class, e)) {
>             LOG.info("Got interrupted exception shutting thread down...");
>             suicideFn.run();
>         }
> {code}
> Was how this was initially ported, and STORM-2142 changed this to:
> {code}
>         if (Utils.exceptionCauseIsInstanceOf(InterruptedException.class, e)
>                 || 
> Utils.exceptionCauseIsInstanceOf(java.io.InterruptedIOException.class, e)) {
>             LOG.info("Got interrupted exception shutting thread down...");
>         } else {
>             suicideFn.run();
>         }
> {code}
> However, I believe the correct port is as described above:
> {code}
>         if (Utils.exceptionCauseIsInstanceOf(InterruptedException.class, e)
>                 || 
> Utils.exceptionCauseIsInstanceOf(java.io.InterruptedIOException.class, e)) {
>             LOG.info("Got interrupted exception shutting thread down...");
>         }
>         suicideFn.run();
> {code}
> I'll look into providing patches for the 1.x and 2.x branches shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to