[
https://issues.apache.org/jira/browse/STORM-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15654519#comment-15654519
]
Craig Hawco edited comment on STORM-2194 at 11/10/16 4:49 PM:
--------------------------------------------------------------
Will do, however after looking through STORM-772 and STORM-773, it's not
entirely clear what the intent was.
Basically, my understand is that the exception passed here is the exception
from the worker -- and {{exception-cause?}} walks the entire exception chain
checking for {{InterruptedException}} or {{InterruptedIOException}} -- which
may have been due to a bolt talking to an external service (e.g.
{{SocketTimeoutException}} is an {{InterruptedIOException}}).
So, some options:
# My understanding is incorrect, and this error is NOT from the bolt itself
# {{exception-cause?}} doesn't walk the entire exception tree as I thought
# This behaviour is mostly intentional, but should only be applied if error
itself is {{InterruptedException}} or {{InterruptedIOException}}, but not if
those appear anywhere in the chain
I think #2 above is the case:
{code}
user=> (def ex (new RuntimeException (new InterruptedException)))
#'user/ex
(defn exception-cause?
[klass ^Throwable t]
(->> (iterate #(.getCause ^Throwable %) t)
(take-while identity)
(some (partial instance? klass))
boolean))
(->> (iterate #(.getCause ^Throwable %) t)
(take-while identity)
(some (partial instance? klass))
boolean))
(take-while identity)
(some (partial instance? klass))
boolean))
#'user/exception-cause?
user=> (exception-cause? InterruptedException ex)
true
user=>
{code}
So, either the exception isn't the one raised by the bolt (I'll look for some
evidence to support this next), the check should only be ignoring errors that
are themselves {{InterruptedException}}/{{InterruptedIOException}} but not
looking elsewhere in the chain, or this was intended as just a logging line,
and {{sucide-fn}} should still get invoked after logging.
was (Author: chawco):
Will do, however after looking through STORM-772 and STORM-773, it's not
entirely clear what the intent was.
Basically, my understand is that the exception passed here is the exception
from the worker -- and {{exception-cause?}} walks the entire exception chain
checking for {{InterruptedException}} or {{InterruptedIOException}} -- which
may have been due to a bolt talking to an external service (e.g.
{{SocketTimeoutException}} is an {{InterruptedIOException}}).
So, some options:
# My understanding is incorrect, and this error is NOT from the bolt itself
# {{exception-cause?}} doesn't walk the entire exception tree as I thought
# This behaviour is mostly intentional, but should only be applied if error
itself is {{InterruptedException}} or {{InterruptedIOException}}, but not if
those appear anywhere in the chain
I think #3 above isn't actually case:
{code}
user=> (def ex (new RuntimeException (new InterruptedException)))
#'user/ex
(defn exception-cause?
[klass ^Throwable t]
(->> (iterate #(.getCause ^Throwable %) t)
(take-while identity)
(some (partial instance? klass))
boolean))
(->> (iterate #(.getCause ^Throwable %) t)
(take-while identity)
(some (partial instance? klass))
boolean))
(take-while identity)
(some (partial instance? klass))
boolean))
#'user/exception-cause?
user=> (exception-cause? InterruptedException ex)
true
user=>
{code}
So, either the exception isn't the one raised by the bolt (I'll look for some
evidence to support this next), the check should only be ignoring errors that
are themselves {{InterruptedException}}/{{InterruptedIOException}} but not
looking elsewhere in the chain, or this was intended as just a logging line,
and {{sucide-fn}} should still get invoked after logging.
> ReportErrorAndDie doesn't always die
> ------------------------------------
>
> Key: STORM-2194
> URL: https://issues.apache.org/jira/browse/STORM-2194
> Project: Apache Storm
> Issue Type: Bug
> Components: storm-core
> Affects Versions: 2.0.0, 1.0.2
> Reporter: Craig Hawco
> Time Spent: 20m
> Remaining Estimate: 0h
>
> I've been trying to track down a cause of some of our issues with some
> exceptions leaving Storm workers in a zombified state for some time. I
> believe I've isolated the bug to the behaviour in
> :report-error-and-die/reportErrorAndDie in the executor. Essentially:
> {code}
> :report-error-and-die (fn [error]
> (try
> ((:report-error <>) error)
> (catch Exception e
> (log-message "Error while reporting error to
> cluster, proceeding with shutdown")))
> (if (or
> (exception-cause? InterruptedException
> error)
> (exception-cause?
> java.io.InterruptedIOException error))
> (log-message "Got interrupted excpetion
> shutting thread down...")
> ((:suicide-fn <>))))
> {code}
> has the grouping for the if statement slightly wrong. It shouldn't log OR die
> from InterruptedException/InterruptedIOException, but it should log under
> that condition, and ALWAYS die.
> Basically:
> {code}
> :report-error-and-die (fn [error]
> (try
> ((:report-error <>) error)
> (catch Exception e
> (log-message "Error while reporting error to
> cluster, proceeding with shutdown")))
> (if (or
> (exception-cause? InterruptedException
> error)
> (exception-cause?
> java.io.InterruptedIOException error))
> (log-message "Got interrupted excpetion
> shutting thread down..."))
> ((:suicide-fn <>)))
> {code}
> After digging into the Java port of this code, it looks like a different bug
> was introduced while porting:
> {code}
> if (Utils.exceptionCauseIsInstanceOf(InterruptedException.class, e)
> ||
> Utils.exceptionCauseIsInstanceOf(java.io.InterruptedIOException.class, e)) {
> LOG.info("Got interrupted exception shutting thread down...");
> suicideFn.run();
> }
> {code}
> Was how this was initially ported, and STORM-2142 changed this to:
> {code}
> if (Utils.exceptionCauseIsInstanceOf(InterruptedException.class, e)
> ||
> Utils.exceptionCauseIsInstanceOf(java.io.InterruptedIOException.class, e)) {
> LOG.info("Got interrupted exception shutting thread down...");
> } else {
> suicideFn.run();
> }
> {code}
> However, I believe the correct port is as described above:
> {code}
> if (Utils.exceptionCauseIsInstanceOf(InterruptedException.class, e)
> ||
> Utils.exceptionCauseIsInstanceOf(java.io.InterruptedIOException.class, e)) {
> LOG.info("Got interrupted exception shutting thread down...");
> }
> suicideFn.run();
> {code}
> I'll look into providing patches for the 1.x and 2.x branches shortly.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)