[ https://issues.apache.org/jira/browse/ZOOKEEPER-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15114748#comment-15114748 ]
Rakesh R commented on ZOOKEEPER-2247: ------------------------------------- Thanks Flavio for pointing out the multiple execution paths. bq. Could anyone explain to me why we aren't simply relying on the finally blocks? When there is an uncaught exception thrown by any of the internal critical threads, QuourmPeer doesn't have any mechanism to know that internal error state. He still continue with the #readPacket(). For example, [Follower.java#L88|https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/server/quorum/Follower.java#L88] will continue reading without knowing that error. To execute the finally blocks there should be a way to stop this reading logic. So as part of ZOOKEEPER-1907 design discussions, the point has come up to introduce a listening mechanism which will take action and gracefully bring down the QuourmPeer. This made another execution path that change the state of the server. bq. If we can do it, I'd much rather have this option implemented rather than multiple code paths that change the state of the server. I understand your point. How about introducing a polling mechanism at QuorumPeer. Presently ZooKeeperServerListener is taking the decision to shutdown the server, instead of this ZooKeeperServerListener will just mark the internal error state only. Later while polling QuorumPeer will see this error and exits the loop gracefully. The idea is something like, ZooKeeper server will maintain an {{internalErrorState}}, which will be then used by the QuorumPeer while reading the packet. If QuorumPeer sees an error then will break and executes the finally block. On the other side, all the threads will use ZooKeeperServerListener. He will listen the unexpected errors and notify the QuourmPeer about that error by setting {{zk.setInternalErrorState(true)}} to true. QuourmPeer should have a logic like, {code} while (self.isRunning() && !zk.hasInternalError()) { readPacket(qp); processPacket(qp); } {code} Similar polling mechanism has to be introduced at the standalone server [ZooKeeperServerMain.java#L149|https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/server/ZooKeeperServerMain.java#L149] as well. I don't think we need to worry about the other internal exceptions which can occur before the ZK server enters into the #readPacket() state [Follower.java#L88|https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/server/quorum/Follower.java#L88]. I hope all these errors will come out and stops the server gracefully. Please correct me if I'm missing any other cases. > Zookeeper service becomes unavailable when leader fails to write transaction > log > -------------------------------------------------------------------------------- > > Key: ZOOKEEPER-2247 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2247 > Project: ZooKeeper > Issue Type: Bug > Affects Versions: 3.5.0 > Reporter: Arshad Mohammad > Assignee: Arshad Mohammad > Priority: Critical > Fix For: 3.4.8, 3.5.2 > > Attachments: ZOOKEEPER-2247-01.patch, ZOOKEEPER-2247-02.patch, > ZOOKEEPER-2247-03.patch, ZOOKEEPER-2247-04.patch, ZOOKEEPER-2247-05.patch, > ZOOKEEPER-2247-06.patch > > > Zookeeper service becomes unavailable when leader fails to write transaction > log. Bellow are the exceptions > {code} > 2015-08-14 15:41:18,556 [myid:100] - ERROR > [SyncThread:100:ZooKeeperCriticalThread@48] - Severe unrecoverable error, > from thread : SyncThread:100 > java.io.IOException: Input/output error > at sun.nio.ch.FileDispatcherImpl.force0(Native Method) > at sun.nio.ch.FileDispatcherImpl.force(FileDispatcherImpl.java:76) > at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:376) > at > org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:331) > at > org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog.java:380) > at org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:563) > at > org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:178) > at > org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:113) > 2015-08-14 15:41:18,559 [myid:100] - INFO > [SyncThread:100:ZooKeeperServer$ZooKeeperServerListenerImpl@500] - Thread > SyncThread:100 exits, error code 1 > 2015-08-14 15:41:18,559 [myid:100] - INFO > [SyncThread:100:ZooKeeperServer@523] - shutting down > 2015-08-14 15:41:18,560 [myid:100] - INFO > [SyncThread:100:SessionTrackerImpl@232] - Shutting down > 2015-08-14 15:41:18,560 [myid:100] - INFO > [SyncThread:100:LeaderRequestProcessor@77] - Shutting down > 2015-08-14 15:41:18,560 [myid:100] - INFO > [SyncThread:100:PrepRequestProcessor@1035] - Shutting down > 2015-08-14 15:41:18,560 [myid:100] - INFO > [SyncThread:100:ProposalRequestProcessor@88] - Shutting down > 2015-08-14 15:41:18,561 [myid:100] - INFO > [SyncThread:100:CommitProcessor@356] - Shutting down > 2015-08-14 15:41:18,561 [myid:100] - INFO > [CommitProcessor:100:CommitProcessor@191] - CommitProcessor exited loop! > 2015-08-14 15:41:18,562 [myid:100] - INFO > [SyncThread:100:Leader$ToBeAppliedRequestProcessor@915] - Shutting down > 2015-08-14 15:41:18,562 [myid:100] - INFO > [SyncThread:100:FinalRequestProcessor@646] - shutdown of request processor > complete > 2015-08-14 15:41:18,562 [myid:100] - INFO > [SyncThread:100:SyncRequestProcessor@191] - Shutting down > 2015-08-14 15:41:18,563 [myid:100] - INFO [ProcessThread(sid:100 > cport:-1)::PrepRequestProcessor@159] - PrepRequestProcessor exited loop! > {code} > After this exception Leader server still remains leader. After this non > recoverable exception the leader should go down and let other followers > become leader. -- This message was sent by Atlassian JIRA (v6.3.4#6332)