[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12972535#action_12972535
 ] 

Camille Fournier commented on ZOOKEEPER-919:
--------------------------------------------

Ok I managed to figure out what is going on here, at least one thing that could 
be wrong.
Basically, when a follower starts up, it reads the latest snap file and sets 
the dataTree.lastProcessedZxid to be the Zxid specified by that file. This is 
not the right thing to do. When you call snapLog.restore (before you connect to 
the leader), you both deserialize the latest snap log, AND play through the 
transaction logs. While playing through the transaction logs, you find the 
actual last transaction you have recorded which could be less than the 
snapLog's zxid (will explain below). That zxid is returned by the call to 
restore, but we don't actually use it, you always use the Zxid from the chose 
snap file (see FileSnap.deserialize). 

So, follower crashes at zxid N-5, starts back up, replays state to N-5, and 
connects to the leader. 

When it connects to the leader, it gets the leader's zxid of N. In 
Learner.syncWithLeader, it sees that it will get a diff, and sets its 
lastProcessedZxid to N, then acks the leader and starts up. The leader queues 
up the transactions N-4...N and sends those to the follower, followed finally 
by the UPTODATE message.

So what does the Follower now see? Well it sees the transactions in order, 
PROPOSAL, then COMMIT, for all of N-4 to N, followed by the UPTODATE. However, 
the Follower processes the PROPOSAL and COMMIT packets *in separate threads*, 
so it pulls those messages off the wire fast, queues them to be processed by 
the Sync/Commit processor threads, and keeps going until it hits the UPTODATE 
message. At this point, it writes a snapshot with Zxid N. We have NO GUARANTEE 
that we have actually processed transactions N-4 to N, and as Vishal has seen, 
sometimes the thread that writes the snapshot gets executed before any or all 
of the commit processor threads have actually run.

But generally, this is ok, because you are also writing a transaction log of 
transactions N-4 to N, so you still process them, write them to your 
transaction log, and as long as you have processed all of them before your 
Follower goes down again when you recover they will be applied to the snapshot 
and you will be fine. HOWEVER, if you kill the Follower after it has written 
snapshot.N and before it has processed transactions N-4 to N and written them 
to its log, when you restore the Follower it will believe that it is at Zxid N, 
it won't ever see those transactions, and it will never delete those nodes.

I believe one solution might be, when you are restoring the database on startup 
to set the lastProcessedZxid to the zxid of the last transaction in your actual 
tx log, NOT the zxid of the snapshot file. However, that might make restoring 
from copied snapshots break, so I think we should consider other solution 
options. But setting the zxid of the follower to be the server's zxid before it 
processes all the missing transactions is definitely the wrong thing to do.

> Ephemeral nodes remains in one of ensemble after deliberate SIGKILL
> -------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-919
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-919
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.3.1
>         Environment: Linux CentOS 5.3 64bit, JDK 1.6.0-22
> SLES 11
>            Reporter: Chang Song
>            Priority: Blocker
>             Fix For: 3.3.3, 3.4.0
>
>         Attachments: logs.tar.gz, logs2.tar.gz, logs3.tar.gz, zk.patch
>
>
> I was testing stability of Zookeeper ensemble for production deployment. 
> Three node ensemble cluster configuration.
> In a loop, I kill/restart three Zookeeper clients that created one ephemeral 
> node each, and at the same time,
> I killed Java process on one of ensemble (dont' know if it was a leader or 
> not). Then I restarted Zookeeper on the server,
> It turns out that on two zookeeper ensemble servers, all the ephemeral nodes 
> are gone (it should), but on the newly started
> Zookeeper server, the two old ephemeral nodes stayed.  The zookeeper didn't 
> restart in standalone mode since new ephemeral
> nodes gets created on all ensemble servers. 
> I captured the log.
> 2010-11-04 17:48:50,201 - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:17288:nioservercnxn$fact...@250] - 
> Accepted socket connection from /10.25.131.21:11191
> 2010-11-04 17:48:50,202 - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:17288:nioserverc...@776] - Client 
> attempting to establish new session at /10.25.131.21:11191
> 2010-11-04 17:48:50,203 - INFO  [CommitProcessor:1:nioserverc...@1579] - 
> Established session 0x12c160c31fc000b with negotiated timeout 30000 for 
> client /10.25.131.21:11191
> 2010-11-04 17:48:50,206 - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:17288:nioserverc...@633] - 
> EndOfStreamException: Unable to read additional data from client sessionid 
> 0x12c160c31fc000b, likely client has closed socket
> 2010-11-04 17:48:50,207 - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:17288:nioserverc...@1434] - Closed 
> socket connection for client /10.25.131.21:11191 which had sessionid 
> 0x12c160c31fc000b
> 2010-11-04 17:48:50,207 - ERROR [CommitProcessor:1:nioserverc...@444] - 
> Unexpected Exception:
> java.nio.channels.CancelledKeyException
>         at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
>         at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
>         at 
> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:417)
>         at 
> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1508)
>         at 
> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:367)
>         at 
> org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to