[ https://issues.apache.org/jira/browse/STORM-4104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17896482#comment-17896482 ]
Scott Moore commented on STORM-4104: ------------------------------------ Pull request created https://github.com/apache/storm/pull/3739 > Pacemaker server stability issues - e.g. shuts down when topology killed > ------------------------------------------------------------------------ > > Key: STORM-4104 > URL: https://issues.apache.org/jira/browse/STORM-4104 > Project: Apache Storm > Issue Type: Bug > Components: storm-server > Affects Versions: 2.0.0 > Reporter: Scott Moore > Assignee: Scott Moore > Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > StormServerHandler used by Pacemaker Server (and by the Netty Server in each > Worker) is fragile when handling certain Exceptions derived from IOException. > In Storm1 the same handler would ignore Exceptions and only terminate for > serious JVM exceptions such as OutOfMemory. > The same in Storm2 does something similar but, instead of ignoring all > 'regular' Exceptions, has a set of ALLOWED_EXCEPTIONS which can be ignored > but this currently contains just IOException. > The code, as it currently stands, will only ignore specifically IOException. > All other exceptions will cause the runtime to terminate after logging > {color:#172b4d}"Received error in netty thread.. terminating server..."{color} > {color:#172b4d}When a connection from a worker to the Pacemaker Server > terminates - either expected (e.g. killing a topology) or unexpected (e.g. > node in cluster rebooting) - a SocketException is likely to be seen by > Pacemaker Server. This will cause it to terminate.{color} > {color:#172b4d}Now, as SocketException is derived from IOException, I would > say a more robust way for Pacemaker Server to handle this and achieve similar > stability seen with Storm1 is to not only 'swallow' IOExceptions but any > exception derived from IOException too (which will of course include > SocketException).{color} > {color:#172b4d}Modifying the handleUncaughtException to make use of > Utils.exceptionCauseIsInstanceOf would greatly enhance the stability of > Pacemaker and, as StormServerHandler is used in the Worker's Netty Server, > the Workers would also have greater stability from networking exceptions > (e.g. a Worker receiving a transfer from a remote where the remote reboots > should no longer cause the receiving Worker to restart - we do sometimes see > a cascade of worker restarts under such scenarios) > {color} > {color:#172b4d}I have modified a build with such a change and can indeed see > greater stability from Pacemaker Server.{color} > {color:#172b4d}I will have a pull request for the changes I have made linked > to this issue soon.{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)