[
https://issues.apache.org/jira/browse/ZOOKEEPER-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14019392#comment-14019392
]
Marshall McMullen commented on ZOOKEEPER-1934:
----------------------------------------------
Yet before we grabbed this data, the offending node (nodeid=2, myid=2) stated
this:
1723 Jun 4 08:13:30 zookeeper - INFO
[QuorumPeer[myid=2]/10.26.65.47:2181:ZooKeeperServer@156] - Created server with
tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir
/sf/data/zoo
1724 Jun 4 08:13:30 zookeeper - INFO
[QuorumPeer[myid=2]/10.26.65.47:2181:Follower@66] - FOLLOWING - LEADER ELECTION
TOOK - -1401867542249
1725 Jun 4 08:13:30 zookeeper - WARN
[QuorumPeer[myid=2]/10.26.65.47:2181:Learner@240] - Unexpected exception,
tries=0, connecting to /10.26.65.103:2182
1726 Jun 4 08:13:30 localhost at
org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:232)
1727 Jun 4 08:13:30 localhost at
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:74)
1728 Jun 4 08:13:30 localhost at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:967)
1729 Jun 4 08:13:30 zookeeper - INFO
[NIOServerCxnFactory.AcceptThread:/10.26.65.47:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.26.65.103:35987
1730 Jun 4 08:13:30 zookeeper - WARN [NIOWorkerThread-37:NIOServerCnxn@365] -
Exception causing close of session 0x0: ZooKeeperServer not running
1731 Jun 4 08:13:30 zookeeper - INFO [NIOWorkerThread-37:NIOServerCnxn@999] -
Closed socket connection for client /10.26.65.103:35987 (no session established
for client)
1732 Jun 4 08:13:31 zookeeper - INFO
[NIOServerCxnFactory.AcceptThread:/10.26.65.47:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.26.65.103:59764
1733 Jun 4 08:13:31 zookeeper - WARN [NIOWorkerThread-38:NIOServerCnxn@365] -
Exception causing close of session 0x0: ZooKeeperServer not running
1734 Jun 4 08:13:31 zookeeper - INFO [NIOWorkerThread-38:NIOServerCnxn@999] -
Closed socket connection for client /10.26.65.103:59764 (no session established
for client)
1735 Jun 4 08:13:31 zookeeper - INFO
[NIOServerCxnFactory.AcceptThread:/10.26.65.47:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.26.65.103:51005
1736 Jun 4 08:13:31 zookeeper - WARN [NIOWorkerThread-39:NIOServerCnxn@365] -
Exception causing close of session 0x0: ZooKeeperServer not running
1737 Jun 4 08:13:31 zookeeper - INFO [NIOWorkerThread-39:NIOServerCnxn@999] -
Closed socket connection for client /10.26.65.103:51005 (no session established
for client)
1738 Jun 4 08:13:31 zookeeper - INFO
[NIOServerCxnFactory.AcceptThread:/10.26.65.47:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.26.65.3:39628
1739 Jun 4 08:13:31 zookeeper - WARN [NIOWorkerThread-40:NIOServerCnxn@365] -
Exception causing close of session 0x0: ZooKeeperServer not running
1740 Jun 4 08:13:31 zookeeper - INFO [NIOWorkerThread-40:NIOServerCnxn@999] -
Closed socket connection for client /10.26.65.3:39628 (no session established
for client)
1741 Jun 4 08:13:31 zookeeper - INFO
[NIOServerCxnFactory.AcceptThread:/10.26.65.47:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.26.65.3:47705
1742 Jun 4 08:13:31 zookeeper - WARN [NIOWorkerThread-41:NIOServerCnxn@365] -
Exception causing close of session 0x0: ZooKeeperServer not running
1743 Jun 4 08:13:31 zookeeper - INFO [NIOWorkerThread-41:NIOServerCnxn@999] -
Closed socket connection for client /10.26.65.3:47705 (no session established
for client)
1744 Jun 4 08:13:31 zookeeper - INFO
[NIOServerCxnFactory.AcceptThread:/10.26.65.47:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.26.65.3:34353
1745 Jun 4 08:13:31 zookeeper - WARN [NIOWorkerThread-42:NIOServerCnxn@365] -
Exception causing close of session 0x0: ZooKeeperServer not running
1746 Jun 4 08:13:31 zookeeper - INFO [NIOWorkerThread-42:NIOServerCnxn@999] -
Closed socket connection for client /10.26.65.3:34353 (no session established
for client)
1747 Jun 4 08:13:31 zookeeper - INFO
[QuorumPeer[myid=2]/10.26.65.47:2181:Learner@332] - Getting a diff from the
leader 0x300001c4c
1748 Jun 4 08:13:31 zookeeper - INFO
[QuorumPeer[myid=2]/10.26.65.47:2181:Learner@475] - Learner received NEWLEADER
message
1749 Jun 4 08:13:31 zookeeper - WARN
[QuorumPeer[myid=2]/10.26.65.47:2181:QuorumPeer@1271] -
setLastSeenQuorumVerifier called with stale config 4294967296. Current version:
4294967296
1750 Jun 4 08:13:31 zookeeper - INFO
[QuorumPeer[myid=2]/10.26.65.47:2181:FileTxnSnapLog@297] - Snapshotting:
0x300001c4c to /sf/data/zookeeper/10.26.65.47/version-2/snapshot.300001c4c
1751 Jun 4 08:13:31 zookeeper - INFO
[QuorumPeer[myid=2]/10.26.65.47:2181:Learner@460] - Learner received UPTODATE
message
> Stale data received from sync'd ensemble peer
> ---------------------------------------------
>
> Key: ZOOKEEPER-1934
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1934
> Project: ZooKeeper
> Issue Type: Bug
> Reporter: Marshall McMullen
> Attachments: node1.log, node2.log, node3.log, node4.log, node5.log
>
>
> In our regression testing we encountered an error wherein we were caching a
> value we read from zookeeper and then experienced session loss. We
> subsequently got reconnected to a different zookeeper server. When we tried
> to read the same path from this new zookeeper server we are getting a stale
> value.
> Specifically, we are reading "/binchanges" and originally got back a version
> of "4" from the first server. After we lost connection and reconnected before
> the session timeout, we then read "/binchanges" from the new server and got
> back a value of "3".
> It's my understanding of the single system image guarantee that this should
> never be allowed. I realize that the single system image guarantee is still
> quorum based and it's certainly possible that a minority of the ensemble may
> have stale data. However, I also believe that each client has to send the
> highest zxid it's seen as part of its connection request to the server. And
> if the server it's connecting to has a smaller zxid than the value the client
> sends, then the connection request should be refused.
> Assuming I have all of that correct, then I'm at a loss for how this
> happened.
> The failure happened around Jun 4 08:13:44. Just before that, at June 4
> 08:13:30 there was a round of leader election. During that round of leader
> election we voted server with id=4 and zxid=0x300001c4c. This then led to a
> new zxid=0x400000001. The new leader sends a diff to all the servers
> including the one we will soon read the stale data from (id=2). Server with
> ID=2's log files also reflect that as of 08:13:43 it was up to date and
> current with an UPTODATE message.
> I'm going to attach log files from all 5 ensemble nodes. I also used
> zktreeutil to dump the database out for the 5 ensemble nodes. I diff'd those,
> and compared them all for correctness. 1 of the nodes (id=2) has a massively
> divergent zktreeutil dump than the other 4 nodes even though it received the
> diff from the new leader.
> In the attachments there are 5 nodes. I will number each log file by it's
> zookeeper id, e.g. node4_zookeeper.log.
--
This message was sent by Atlassian JIRA
(v6.2#6252)