[
https://issues.apache.org/jira/browse/STORM-4104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Scott Moore updated STORM-4104:
-------------------------------
External issue URL: https://github.com/apache/storm/pull/3739
> Pacemaker server stability issues - e.g. shuts down when topology killed
> ------------------------------------------------------------------------
>
> Key: STORM-4104
> URL: https://issues.apache.org/jira/browse/STORM-4104
> Project: Apache Storm
> Issue Type: Bug
> Components: storm-server
> Affects Versions: 2.0.0
> Reporter: Scott Moore
> Assignee: Scott Moore
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> StormServerHandler used by Pacemaker Server (and by the Netty Server in each
> Worker) is fragile when handling certain Exceptions derived from IOException.
> In Storm1 the same handler would ignore Exceptions and only terminate for
> serious JVM exceptions such as OutOfMemory.
> The same in Storm2 does something similar but, instead of ignoring all
> 'regular' Exceptions, has a set of ALLOWED_EXCEPTIONS which can be ignored
> but this currently contains just IOException.
> The code, as it currently stands, will only ignore specifically IOException.
> All other exceptions will cause the runtime to terminate after logging
> {color:#172b4d}"Received error in netty thread.. terminating server..."{color}
> {color:#172b4d}When a connection from a worker to the Pacemaker Server
> terminates - either expected (e.g. killing a topology) or unexpected (e.g.
> node in cluster rebooting) - a SocketException is likely to be seen by
> Pacemaker Server. This will cause it to terminate.{color}
> {color:#172b4d}Now, as SocketException is derived from IOException, I would
> say a more robust way for Pacemaker Server to handle this and achieve similar
> stability seen with Storm1 is to not only 'swallow' IOExceptions but any
> exception derived from IOException too (which will of course include
> SocketException).{color}
> {color:#172b4d}Modifying the handleUncaughtException to make use of
> Utils.exceptionCauseIsInstanceOf would greatly enhance the stability of
> Pacemaker and, as StormServerHandler is used in the Worker's Netty Server,
> the Workers would also have greater stability from networking exceptions
> (e.g. a Worker receiving a transfer from a remote where the remote reboots
> should no longer cause the receiving Worker to restart - we do sometimes see
> a cascade of worker restarts under such scenarios)
> {color}
> {color:#172b4d}I have modified a build with such a change and can indeed see
> greater stability from Pacemaker Server.{color}
> {color:#172b4d}I will have a pull request for the changes I have made linked
> to this issue soon.{color}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)