[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Skye Wanderman-Milne updated ZOOKEEPER-1599:
--------------------------------------------

    Attachment: ZOOKEEPER-1599.patch

Here's a potential fix, but I don't grok Zab well enough to say whether this 
will create new problems. This patch, which should be applied to 3.4, has the 
leader send an UPTODATE _before_ waiting for an ACK if the follower is running 
the old Zab protocol. This means the follower receives an UPTODATE before the 
leader ZK server starts (i.e., before the leader gets ACKs from a quorum of 
followers), which smells fishy to me, but maybe it's a better problem to have 
than the current one. FWIW, this is how the old Zab protocol worked too.

Another option would be to patch 3.3 to speak Zab 1.0. This seems less useful 
though since you'd have to upgrade 3.3 before safely upgrading to 3.4. 
                
> 3.3 server cannot join 3.4 quorum
> ---------------------------------
>
>                 Key: ZOOKEEPER-1599
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1599
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.6, 3.4.5
>            Reporter: Skye Wanderman-Milne
>            Assignee: Skye Wanderman-Milne
>            Priority: Blocker
>             Fix For: 3.3.7, 3.4.6
>
>         Attachments: ZOOKEEPER-1599.patch
>
>
> When a 3.3 server attempts to join an existing quorum lead by a 3.4 server, 
> the 3.3 server is disconnected while trying to download the leader's 
> snapshot. The 3.3 server restarts and starts the process over again, but is 
> never able to join the quorum.
> 3.3 server log:
> {code}
> 2012-12-07 10:44:34,582 - INFO  
> [QuorumPeer:/0:0:0:0:0:0:0:0:2183:Learner@294] - Getting a snapshot from 
> leader
> 2012-12-07 10:44:34,582 - INFO  
> [QuorumPeer:/0:0:0:0:0:0:0:0:2183:Learner@325] - Setting leader epoch 12
> 2012-12-07 10:44:54,604 - WARN  
> [QuorumPeer:/0:0:0:0:0:0:0:0:2183:Follower@82] - Exception when following the 
> leader
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at 
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:84)
>         at 
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
>         at 
> org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:148)
>         at 
> org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:332)
>         at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:75)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:645)
> 2012-12-07 10:44:54,605 - INFO  
> [QuorumPeer:/0:0:0:0:0:0:0:0:2183:Follower@165] - shutdown called
> java.lang.Exception: shutdown Follower
>         at 
> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:165)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:649)
> {code}
> 3.4 leader log:
> {code}
> 2012-12-07 10:51:35,178 [myid:2] - INFO  
> [WorkerReceiver[myid=2]:FastLeaderElection$Messenger$WorkerReceiver@273] - 
> Backward compatibility mode, server id=3
> 2012-12-07 10:51:35,178 [myid:2] - INFO  
> [WorkerReceiver[myid=2]:FastLeaderElection@542] - Notification: 3 (n.leader), 
> 0x1100000000 (n.zxid), 0x2 (n.round), LOOKING (n.state), 3 (n.sid), 0x11 
> (n.peerEPoch), LEADING (my state)
> 2012-12-07 10:51:35,182 [myid:2] - INFO  
> [LearnerHandler-/127.0.0.1:37654:LearnerHandler@263] - Follower sid: 3 : info 
> : org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@262f4873
> 2012-12-07 10:51:35,182 [myid:2] - INFO  
> [LearnerHandler-/127.0.0.1:37654:LearnerHandler@318] - Synchronizing with 
> Follower sid: 3 maxCommittedLog=0x0 minCommittedLog=0x0 
> peerLastZxid=0x1100000000
> 2012-12-07 10:51:35,182 [myid:2] - INFO  
> [LearnerHandler-/127.0.0.1:37654:LearnerHandler@395] - Sending SNAP
> 2012-12-07 10:51:35,183 [myid:2] - INFO  
> [LearnerHandler-/127.0.0.1:37654:LearnerHandler@419] - Sending snapshot last 
> zxid of peer is 0x1100000000  zxid of leader is 0x1200000000sent zxid of db 
> as 0x1200000000
> 2012-12-07 10:51:55,204 [myid:2] - ERROR 
> [LearnerHandler-/127.0.0.1:37654:LearnerHandler@562] - Unexpected exception 
> causing shutdown while sock still open
> java.net.SocketTimeoutException: Read timed out
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.read(SocketInputStream.java:150)
>         at java.net.SocketInputStream.read(SocketInputStream.java:121)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>         at 
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>         at 
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
>         at 
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:450)
> 2012-12-07 10:51:55,205 [myid:2] - WARN  
> [LearnerHandler-/127.0.0.1:37654:LearnerHandler@575] - ******* GOODBYE 
> /127.0.0.1:37654 ********
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to