[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15114748#comment-15114748
 ] 

Rakesh R commented on ZOOKEEPER-2247:
-------------------------------------

Thanks Flavio for pointing out the multiple execution paths.

bq. Could anyone explain to me why we aren't simply relying on the finally 
blocks?
When there is an uncaught exception thrown by any of the internal critical 
threads, QuourmPeer doesn't have any mechanism to know that internal error 
state. He still continue with the #readPacket(). For example,  
[Follower.java#L88|https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/server/quorum/Follower.java#L88]
 will continue reading without knowing that error. To execute the finally 
blocks there should be a way to stop this reading logic. So as part of 
ZOOKEEPER-1907 design discussions, the point has come up to introduce a 
listening mechanism which will take action and gracefully bring down the 
QuourmPeer. This made another execution path that change the state of the 
server.

bq. If we can do it, I'd much rather have this option implemented rather than 
multiple code paths that change the state of the server.
I understand your point. How about introducing a polling mechanism at 
QuorumPeer. Presently ZooKeeperServerListener is taking the decision to 
shutdown the server, instead of this ZooKeeperServerListener will just mark the 
internal error state only. Later while polling QuorumPeer will see this error 
and exits the loop gracefully.

The idea is something like, ZooKeeper server will maintain an 
{{internalErrorState}}, which will be then used by the QuorumPeer while reading 
the packet. If QuorumPeer sees an error then will break and executes the 
finally block. On the other side, all the threads will use 
ZooKeeperServerListener. He will listen the unexpected errors and notify the 
QuourmPeer about that error by setting {{zk.setInternalErrorState(true)}} to 
true.

QuourmPeer should have a logic like,
{code}
    while (self.isRunning() && !zk.hasInternalError()) {
        readPacket(qp);
        processPacket(qp);
    }
{code}

Similar polling mechanism has to be introduced at the standalone server 
[ZooKeeperServerMain.java#L149|https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/server/ZooKeeperServerMain.java#L149]
 as well.

I don't think we need to worry about the other internal exceptions which can 
occur before the ZK server enters into the #readPacket() state 
[Follower.java#L88|https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/server/quorum/Follower.java#L88].
 I hope all these errors will come out and stops the server gracefully. Please 
correct me if I'm missing any other cases.

> Zookeeper service becomes unavailable when leader fails to write transaction 
> log
> --------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2247
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2247
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.5.0
>            Reporter: Arshad Mohammad
>            Assignee: Arshad Mohammad
>            Priority: Critical
>             Fix For: 3.4.8, 3.5.2
>
>         Attachments: ZOOKEEPER-2247-01.patch, ZOOKEEPER-2247-02.patch, 
> ZOOKEEPER-2247-03.patch, ZOOKEEPER-2247-04.patch, ZOOKEEPER-2247-05.patch, 
> ZOOKEEPER-2247-06.patch
>
>
> Zookeeper service becomes unavailable when leader fails to write transaction 
> log. Bellow are the exceptions
> {code}
> 2015-08-14 15:41:18,556 [myid:100] - ERROR 
> [SyncThread:100:ZooKeeperCriticalThread@48] - Severe unrecoverable error, 
> from thread : SyncThread:100
> java.io.IOException: Input/output error
>       at sun.nio.ch.FileDispatcherImpl.force0(Native Method)
>       at sun.nio.ch.FileDispatcherImpl.force(FileDispatcherImpl.java:76)
>       at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:376)
>       at 
> org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:331)
>       at 
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog.java:380)
>       at org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:563)
>       at 
> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:178)
>       at 
> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:113)
> 2015-08-14 15:41:18,559 [myid:100] - INFO  
> [SyncThread:100:ZooKeeperServer$ZooKeeperServerListenerImpl@500] - Thread 
> SyncThread:100 exits, error code 1
> 2015-08-14 15:41:18,559 [myid:100] - INFO  
> [SyncThread:100:ZooKeeperServer@523] - shutting down
> 2015-08-14 15:41:18,560 [myid:100] - INFO  
> [SyncThread:100:SessionTrackerImpl@232] - Shutting down
> 2015-08-14 15:41:18,560 [myid:100] - INFO  
> [SyncThread:100:LeaderRequestProcessor@77] - Shutting down
> 2015-08-14 15:41:18,560 [myid:100] - INFO  
> [SyncThread:100:PrepRequestProcessor@1035] - Shutting down
> 2015-08-14 15:41:18,560 [myid:100] - INFO  
> [SyncThread:100:ProposalRequestProcessor@88] - Shutting down
> 2015-08-14 15:41:18,561 [myid:100] - INFO  
> [SyncThread:100:CommitProcessor@356] - Shutting down
> 2015-08-14 15:41:18,561 [myid:100] - INFO  
> [CommitProcessor:100:CommitProcessor@191] - CommitProcessor exited loop!
> 2015-08-14 15:41:18,562 [myid:100] - INFO  
> [SyncThread:100:Leader$ToBeAppliedRequestProcessor@915] - Shutting down
> 2015-08-14 15:41:18,562 [myid:100] - INFO  
> [SyncThread:100:FinalRequestProcessor@646] - shutdown of request processor 
> complete
> 2015-08-14 15:41:18,562 [myid:100] - INFO  
> [SyncThread:100:SyncRequestProcessor@191] - Shutting down
> 2015-08-14 15:41:18,563 [myid:100] - INFO  [ProcessThread(sid:100 
> cport:-1)::PrepRequestProcessor@159] - PrepRequestProcessor exited loop!
> {code}
> After this exception Leader server still remains leader. After this non 
> recoverable exception the leader should go down and let other followers 
> become leader.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to