Scott Moore created STORM-4104:
----------------------------------
Summary: Pacemaker server stability issues - e.g. shuts down when
topology killed
Key: STORM-4104
URL: https://issues.apache.org/jira/browse/STORM-4104
Project: Apache Storm
Issue Type: Bug
Components: storm-server
Affects Versions: 2.0.0
Reporter: Scott Moore
Assignee: Scott Moore
StormServerHandler used by Pacemaker Server (and by the Netty Server in each
Worker) is fragile when handling certain Exceptions derived from IOException.
In Storm1 the same handler would ignore Exceptions and only terminate for
serious JVM exceptions such as OutOfMemory.
The same in Storm2 does something similar but, instead of ignoring all
'regular' Exceptions, has a set of ALLOWED_EXCEPTIONS which can be ignored but
this currently contains just IOException.
The code, as it currently stands, will only ignore specifically IOException.
All other exceptions will cause the runtime to terminate after logging
{color:#172b4d}"Received error in netty thread.. terminating server..."{color}
{color:#172b4d}When a connection from a worker to the Pacemaker Server
terminates - either expected (e.g. killing a topology) or unexpected (e.g. node
in cluster rebooting) - a SocketException is likely to be seen by Pacemaker
Server. This will cause it to terminate.{color}
{color:#172b4d}Now, as SocketException is derived from IOException, I would say
a more robust way for Pacemaker Server to handle this and achieve similar
stability seen with Storm1 is to not only 'swallow' IOExceptions but any
exception derived from IOException too (which will of course include
SocketException).{color}
{color:#172b4d}Modifying the handleUncaughtException to make use of
Utils.exceptionCauseIsInstanceOf would greatly enhance the stability of
Pacemaker and, as StormServerHandler is used in the Worker's Netty Server, the
Workers would also have greater stability from networking exceptions (e.g. a
Worker receiving a transfer from a remote where the remote reboots should no
longer cause the receiving Worker to restart - we do sometimes see a cascade of
worker restarts under such scenarios)
{color}
{color:#172b4d}I have modified a build with such a change and can indeed see
greater stability from Pacemaker Server.{color}
{color:#172b4d}I will have a pull request for the changes I have made linked to
this issue soon.{color}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)