[
https://issues.apache.org/jira/browse/ZOOKEEPER-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135644#comment-14135644
]
Jared Cantwell commented on ZOOKEEPER-1934:
-------------------------------------------
I am working with Marshall on this issue. Upon investigating more, I noticed
that all of our processes that had problems had just re-established a
connection to node2 immediately before getting stale state from that ZK
(through node2). This seems to match with Marshall's observation that dumping
state with zktreeutil on node2 shows a state that is out of date. It was even
stale some time afterward.
So right now I am investigating how node2 could think its local database was at
zxid N+1 (hence why it accepted the connection from the client), but somehow
the state of the database is reflecting zxid <= N. Leader election happened
about 15 seconds prior to this, but connections were just getting session
connected events. I'm still digging through the leader election logs, but
nothing stands out as suspicious to me yet.
Any help or ideas for things to look for would be much appreciated.
> Stale data received from sync'd ensemble peer
> ---------------------------------------------
>
> Key: ZOOKEEPER-1934
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1934
> Project: ZooKeeper
> Issue Type: Bug
> Affects Versions: 3.5.0
> Reporter: Marshall McMullen
> Attachments: node1.log, node2.log, node3.log, node4.log, node5.log
>
>
> In our regression testing we encountered an error wherein we were caching a
> value we read from zookeeper and then experienced session loss. We
> subsequently got reconnected to a different zookeeper server. When we tried
> to read the same path from this new zookeeper server we are getting a stale
> value.
> Specifically, we are reading "/binchanges" and originally got back a value of
> "3" from the first server. After we lost connection and reconnected before
> the session timeout, we then read "/binchanges" from the new server and got
> back a value of "2". In our code path we never set this value from 3 to 2. We
> throw an assertion if the value ever goes backwards. Which is how we caught
> this error.
> It's my understanding of the single system image guarantee that this should
> never be allowed. I realize that the single system image guarantee is still
> quorum based and it's certainly possible that a minority of the ensemble may
> have stale data. However, I also believe that each client has to send the
> highest zxid it's seen as part of its connection request to the server. And
> if the server it's connecting to has a smaller zxid than the value the client
> sends, then the connection request should be refused.
> Assuming I have all of that correct, then I'm at a loss for how this
> happened.
> The failure happened around Jun 4 08:13:44. Just before that, at June 4
> 08:13:30 there was a round of leader election. During that round of leader
> election we voted server with id=4 and zxid=0x300001c4c. This then led to a
> new zxid=0x400000001. The new leader sends a diff to all the servers
> including the one we will soon read the stale data from (id=2). Server with
> ID=2's log files also reflect that as of 08:13:43 it was up to date and
> current with an UPTODATE message.
> I'm going to attach log files from all 5 ensemble nodes. I also used
> zktreeutil to dump the database out for the 5 ensemble nodes. I diff'd those,
> and compared them all for correctness. 1 of the nodes (id=2) has a massively
> divergent zktreeutil dump than the other 4 nodes even though it received the
> diff from the new leader.
> In the attachments there are 5 nodes. I will number each log file by it's
> zookeeper id, e.g. node4.log.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)