[jira] Commented: (ZOOKEEPER-919) Ephemeral nodes remains in one of ensemble after deliberate SIGKILL

Vishal K (JIRA) Mon, 03 Jan 2011 01:51:15 -0800

    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976668#action_12976668
 ]


Vishal K commented on ZOOKEEPER-919:
------------------------------------

Hi Camille,

I forgot to mention about this earlier. In your description above you
mentioned "But generally, this is ok, because you are also writing a
transaction log of transactions N-4 to N, so you still process them,
write them to your transaction log, and as long as you have processed
all of them before your Follower goes down again when you recover they
will be applied to the snapshot and you will be fine. HOWEVER, if you
kill the Follower after it has written snapshot.N and before it has
processed transactions N-4 to N and written them to its log, when you
restore the Follower it will believe that it is at Zxid N, it won't
ever see those transactions, and it will never delete those nodes."

I don't think this is true. I don't think that we need to kill the
follower after creating snapshot.N file and before writing N to the
log. We have a bug even without this failure (and my test was not
failing the Follower during this window). FileTxnSnapLog.restore()
does not replay transactions that are < zxid of the snapshot. As a
result, we have this bug even without the failure that you
mentioned. Let me know if you think otherwise.

My change takes a safe approach and ensures that the zxid of the
snapshot file at the follower is equal to the zxid of the last
transaction processed by the follower.

Ideally, I think we need to change Follower.processPacket() to make
sure that while handling UPTODATE before taking snapshot all the
necessary transactions are committed locally.

Thanks.

> Ephemeral nodes remains in one of ensemble after deliberate SIGKILL
> -------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-919
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-919
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.3.1
>         Environment: Linux CentOS 5.3 64bit, JDK 1.6.0-22
> SLES 11
>            Reporter: Chang Song
>            Priority: Blocker
>             Fix For: 3.3.3, 3.4.0
>
>         Attachments: logs.tar.gz, logs2.tar.gz, logs3.tar.gz, zk.patch
>
>
> I was testing stability of Zookeeper ensemble for production deployment. 
> Three node ensemble cluster configuration.
> In a loop, I kill/restart three Zookeeper clients that created one ephemeral 
> node each, and at the same time,
> I killed Java process on one of ensemble (dont' know if it was a leader or 
> not). Then I restarted Zookeeper on the server,
> It turns out that on two zookeeper ensemble servers, all the ephemeral nodes 
> are gone (it should), but on the newly started
> Zookeeper server, the two old ephemeral nodes stayed.  The zookeeper didn't 
> restart in standalone mode since new ephemeral
> nodes gets created on all ensemble servers. 
> I captured the log.
> 2010-11-04 17:48:50,201 - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:17288:nioservercnxn$fact...@250] - 
> Accepted socket connection from /10.25.131.21:11191
> 2010-11-04 17:48:50,202 - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:17288:nioserverc...@776] - Client 
> attempting to establish new session at /10.25.131.21:11191
> 2010-11-04 17:48:50,203 - INFO  [CommitProcessor:1:nioserverc...@1579] - 
> Established session 0x12c160c31fc000b with negotiated timeout 30000 for 
> client /10.25.131.21:11191
> 2010-11-04 17:48:50,206 - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:17288:nioserverc...@633] - 
> EndOfStreamException: Unable to read additional data from client sessionid 
> 0x12c160c31fc000b, likely client has closed socket
> 2010-11-04 17:48:50,207 - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:17288:nioserverc...@1434] - Closed 
> socket connection for client /10.25.131.21:11191 which had sessionid 
> 0x12c160c31fc000b
> 2010-11-04 17:48:50,207 - ERROR [CommitProcessor:1:nioserverc...@444] - 
> Unexpected Exception:
> java.nio.channels.CancelledKeyException
>         at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
>         at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
>         at 
> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:417)
>         at 
> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1508)
>         at 
> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:367)
>         at 
> org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-919) Ephemeral nodes remains in one of ensemble after deliberate SIGKILL

Reply via email to