[
https://issues.apache.org/jira/browse/ZOOKEEPER-919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12972535#action_12972535
]
Camille Fournier commented on ZOOKEEPER-919:
--------------------------------------------
Ok I managed to figure out what is going on here, at least one thing that could
be wrong.
Basically, when a follower starts up, it reads the latest snap file and sets
the dataTree.lastProcessedZxid to be the Zxid specified by that file. This is
not the right thing to do. When you call snapLog.restore (before you connect to
the leader), you both deserialize the latest snap log, AND play through the
transaction logs. While playing through the transaction logs, you find the
actual last transaction you have recorded which could be less than the
snapLog's zxid (will explain below). That zxid is returned by the call to
restore, but we don't actually use it, you always use the Zxid from the chose
snap file (see FileSnap.deserialize).
So, follower crashes at zxid N-5, starts back up, replays state to N-5, and
connects to the leader.
When it connects to the leader, it gets the leader's zxid of N. In
Learner.syncWithLeader, it sees that it will get a diff, and sets its
lastProcessedZxid to N, then acks the leader and starts up. The leader queues
up the transactions N-4...N and sends those to the follower, followed finally
by the UPTODATE message.
So what does the Follower now see? Well it sees the transactions in order,
PROPOSAL, then COMMIT, for all of N-4 to N, followed by the UPTODATE. However,
the Follower processes the PROPOSAL and COMMIT packets *in separate threads*,
so it pulls those messages off the wire fast, queues them to be processed by
the Sync/Commit processor threads, and keeps going until it hits the UPTODATE
message. At this point, it writes a snapshot with Zxid N. We have NO GUARANTEE
that we have actually processed transactions N-4 to N, and as Vishal has seen,
sometimes the thread that writes the snapshot gets executed before any or all
of the commit processor threads have actually run.
But generally, this is ok, because you are also writing a transaction log of
transactions N-4 to N, so you still process them, write them to your
transaction log, and as long as you have processed all of them before your
Follower goes down again when you recover they will be applied to the snapshot
and you will be fine. HOWEVER, if you kill the Follower after it has written
snapshot.N and before it has processed transactions N-4 to N and written them
to its log, when you restore the Follower it will believe that it is at Zxid N,
it won't ever see those transactions, and it will never delete those nodes.
I believe one solution might be, when you are restoring the database on startup
to set the lastProcessedZxid to the zxid of the last transaction in your actual
tx log, NOT the zxid of the snapshot file. However, that might make restoring
from copied snapshots break, so I think we should consider other solution
options. But setting the zxid of the follower to be the server's zxid before it
processes all the missing transactions is definitely the wrong thing to do.
> Ephemeral nodes remains in one of ensemble after deliberate SIGKILL
> -------------------------------------------------------------------
>
> Key: ZOOKEEPER-919
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-919
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Affects Versions: 3.3.1
> Environment: Linux CentOS 5.3 64bit, JDK 1.6.0-22
> SLES 11
> Reporter: Chang Song
> Priority: Blocker
> Fix For: 3.3.3, 3.4.0
>
> Attachments: logs.tar.gz, logs2.tar.gz, logs3.tar.gz, zk.patch
>
>
> I was testing stability of Zookeeper ensemble for production deployment.
> Three node ensemble cluster configuration.
> In a loop, I kill/restart three Zookeeper clients that created one ephemeral
> node each, and at the same time,
> I killed Java process on one of ensemble (dont' know if it was a leader or
> not). Then I restarted Zookeeper on the server,
> It turns out that on two zookeeper ensemble servers, all the ephemeral nodes
> are gone (it should), but on the newly started
> Zookeeper server, the two old ephemeral nodes stayed. The zookeeper didn't
> restart in standalone mode since new ephemeral
> nodes gets created on all ensemble servers.
> I captured the log.
> 2010-11-04 17:48:50,201 - INFO
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:17288:nioservercnxn$fact...@250] -
> Accepted socket connection from /10.25.131.21:11191
> 2010-11-04 17:48:50,202 - INFO
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:17288:nioserverc...@776] - Client
> attempting to establish new session at /10.25.131.21:11191
> 2010-11-04 17:48:50,203 - INFO [CommitProcessor:1:nioserverc...@1579] -
> Established session 0x12c160c31fc000b with negotiated timeout 30000 for
> client /10.25.131.21:11191
> 2010-11-04 17:48:50,206 - WARN
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:17288:nioserverc...@633] -
> EndOfStreamException: Unable to read additional data from client sessionid
> 0x12c160c31fc000b, likely client has closed socket
> 2010-11-04 17:48:50,207 - INFO
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:17288:nioserverc...@1434] - Closed
> socket connection for client /10.25.131.21:11191 which had sessionid
> 0x12c160c31fc000b
> 2010-11-04 17:48:50,207 - ERROR [CommitProcessor:1:nioserverc...@444] -
> Unexpected Exception:
> java.nio.channels.CancelledKeyException
> at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
> at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
> at
> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:417)
> at
> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1508)
> at
> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:367)
> at
> org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.