[jira] [Updated] (STORM-4104) Pacemaker server stability issues - e.g. shuts down when topology killed

Scott Moore (Jira) Thu, 07 Nov 2024 14:26:29 -0800


     [ 
https://issues.apache.org/jira/browse/STORM-4104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Scott Moore updated STORM-4104:
-------------------------------
    External issue URL: https://github.com/apache/storm/pull/3739

> Pacemaker server stability issues - e.g. shuts down when topology killed
> ------------------------------------------------------------------------
>
>                 Key: STORM-4104
>                 URL: https://issues.apache.org/jira/browse/STORM-4104
>             Project: Apache Storm
>          Issue Type: Bug
>          Components: storm-server
>    Affects Versions: 2.0.0
>            Reporter: Scott Moore
>            Assignee: Scott Moore
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> StormServerHandler used by Pacemaker Server (and by the Netty Server in each 
> Worker) is fragile when handling certain Exceptions derived from IOException.
> In Storm1 the same handler would ignore Exceptions and only terminate for 
> serious JVM exceptions such as OutOfMemory.
> The same in Storm2 does something similar but, instead of ignoring all 
> 'regular' Exceptions, has a set of ALLOWED_EXCEPTIONS which can be ignored 
> but this currently contains just IOException.
> The code, as it currently stands, will only ignore specifically IOException. 
> All other exceptions will cause the runtime to terminate after logging 
> {color:#172b4d}"Received error in netty thread.. terminating server..."{color}
> {color:#172b4d}When a connection from a worker to the Pacemaker Server 
> terminates - either expected (e.g. killing a topology) or unexpected (e.g. 
> node in cluster rebooting) - a SocketException is likely to be seen by 
> Pacemaker Server. This will cause it to terminate.{color}
> {color:#172b4d}Now, as SocketException is derived from IOException, I would 
> say a more robust way for Pacemaker Server to handle this and achieve similar 
> stability seen with Storm1 is to not only 'swallow' IOExceptions but any 
> exception derived from IOException too (which will of course include 
> SocketException).{color}
> {color:#172b4d}Modifying the handleUncaughtException to make use of 
> Utils.exceptionCauseIsInstanceOf would greatly enhance the stability of 
> Pacemaker and, as StormServerHandler is used in the Worker's Netty Server, 
> the Workers would also have greater stability from networking exceptions 
> (e.g. a Worker receiving a transfer from a remote where the remote reboots 
> should no longer cause the receiving Worker to restart - we do sometimes see 
> a cascade of worker restarts under such scenarios)
> {color}
> {color:#172b4d}I have modified a build with such a change and can indeed see 
> greater stability from Pacemaker Server.{color}
> {color:#172b4d}I will have a pull request for the changes I have made linked 
> to this issue soon.{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (STORM-4104) Pacemaker server stability issues - e.g. shuts down when topology killed

Reply via email to