Scott Moore created STORM-4104:
----------------------------------

             Summary: Pacemaker server stability issues - e.g. shuts down when 
topology killed
                 Key: STORM-4104
                 URL: https://issues.apache.org/jira/browse/STORM-4104
             Project: Apache Storm
          Issue Type: Bug
          Components: storm-server
    Affects Versions: 2.0.0
            Reporter: Scott Moore
            Assignee: Scott Moore


StormServerHandler used by Pacemaker Server (and by the Netty Server in each 
Worker) is fragile when handling certain Exceptions derived from IOException.

In Storm1 the same handler would ignore Exceptions and only terminate for 
serious JVM exceptions such as OutOfMemory.

The same in Storm2 does something similar but, instead of ignoring all 
'regular' Exceptions, has a set of ALLOWED_EXCEPTIONS which can be ignored but 
this currently contains just IOException.

The code, as it currently stands, will only ignore specifically IOException. 
All other exceptions will cause the runtime to terminate after logging 
{color:#172b4d}"Received error in netty thread.. terminating server..."{color}

{color:#172b4d}When a connection from a worker to the Pacemaker Server 
terminates - either expected (e.g. killing a topology) or unexpected (e.g. node 
in cluster rebooting) - a SocketException is likely to be seen by Pacemaker 
Server. This will cause it to terminate.{color}

{color:#172b4d}Now, as SocketException is derived from IOException, I would say 
a more robust way for Pacemaker Server to handle this and achieve similar 
stability seen with Storm1 is to not only 'swallow' IOExceptions but any 
exception derived from IOException too (which will of course include 
SocketException).{color}

{color:#172b4d}Modifying the handleUncaughtException to make use of 
Utils.exceptionCauseIsInstanceOf would greatly enhance the stability of 
Pacemaker and, as StormServerHandler is used in the Worker's Netty Server, the 
Workers would also have greater stability from networking exceptions (e.g. a 
Worker receiving a transfer from a remote where the remote reboots should no 
longer cause the receiving Worker to restart - we do sometimes see a cascade of 
worker restarts under such scenarios)
{color}

{color:#172b4d}I have modified a build with such a change and can indeed see 
greater stability from Pacemaker Server.{color}

{color:#172b4d}I will have a pull request for the changes I have made linked to 
this issue soon.{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to