[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-1934:
-----------------------------------------

    Description: 
In our regression testing we encountered an error wherein we were caching a 
value we read from zookeeper and then experienced session loss. We subsequently 
got reconnected to a different zookeeper server. When we tried to read the same 
path from this new zookeeper server we are getting a stale value.

Specifically, we are reading "/binchanges" and originally got back a value of 
"3" from the first server. After we lost connection and reconnected before the 
session timeout, we then read "/binchanges" from the new server and got back a 
value of "2". In our code path we never set this value from 3 to 2. We throw an 
assertion if the value ever goes backwards. Which is how we caught this error. 

It's my understanding of the single system image guarantee that this should 
never be allowed. I realize that the single system image guarantee is still 
quorum based and it's certainly possible that a minority of the ensemble may 
have stale data. However, I also believe that each client has to send the 
highest zxid it's seen as part of its connection request to the server. And if 
the server it's connecting to has a smaller zxid than the value the client 
sends, then the connection request should be refused.

Assuming I have all of that correct, then I'm at a loss for how this happened. 

The failure happened around Jun  4 08:13:44. Just before that, at June  4 
08:13:30 there was a round of leader election. During that round of leader 
election we voted server with id=4 and zxid=0x300001c4c. This then led to a new 
zxid=0x400000001. The new leader sends a diff to all the servers including the 
one we will soon read the stale data from (id=2). Server with ID=2's log files 
also reflect that as of 08:13:43 it was up to date and current with an UPTODATE 
message.

I'm going to attach log files from all 5 ensemble nodes. I also used zktreeutil 
to dump the database out for the 5 ensemble nodes. I diff'd those, and compared 
them all for correctness. 1 of the nodes (id=2) has a massively divergent 
zktreeutil dump than the other 4 nodes even though it received the diff from 
the new leader.

In the attachments there are 5 nodes. I will number each log file by it's 
zookeeper id, e.g. node4.log.





  was:
In our regression testing we encountered an error wherein we were caching a 
value we read from zookeeper and then experienced session loss. We subsequently 
got reconnected to a different zookeeper server. When we tried to read the same 
path from this new zookeeper server we are getting a stale value.

Specifically, we are reading "/binchanges" and originally got back a value of 
"4" from the first server. After we lost connection and reconnected before the 
session timeout, we then read "/binchanges" from the new server and got back a 
value of "3". 

It's my understanding of the single system image guarantee that this should 
never be allowed. I realize that the single system image guarantee is still 
quorum based and it's certainly possible that a minority of the ensemble may 
have stale data. However, I also believe that each client has to send the 
highest zxid it's seen as part of its connection request to the server. And if 
the server it's connecting to has a smaller zxid than the value the client 
sends, then the connection request should be refused.

Assuming I have all of that correct, then I'm at a loss for how this happened. 

The failure happened around Jun  4 08:13:44. Just before that, at June  4 
08:13:30 there was a round of leader election. During that round of leader 
election we voted server with id=4 and zxid=0x300001c4c. This then led to a new 
zxid=0x400000001. The new leader sends a diff to all the servers including the 
one we will soon read the stale data from (id=2). Server with ID=2's log files 
also reflect that as of 08:13:43 it was up to date and current with an UPTODATE 
message.

I'm going to attach log files from all 5 ensemble nodes. I also used zktreeutil 
to dump the database out for the 5 ensemble nodes. I diff'd those, and compared 
them all for correctness. 1 of the nodes (id=2) has a massively divergent 
zktreeutil dump than the other 4 nodes even though it received the diff from 
the new leader.

In the attachments there are 5 nodes. I will number each log file by it's 
zookeeper id, e.g. node4.log.






> Stale data received from sync'd ensemble peer
> ---------------------------------------------
>
>                 Key: ZOOKEEPER-1934
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1934
>             Project: ZooKeeper
>          Issue Type: Bug
>            Reporter: Marshall McMullen
>         Attachments: node1.log, node2.log, node3.log, node4.log, node5.log
>
>
> In our regression testing we encountered an error wherein we were caching a 
> value we read from zookeeper and then experienced session loss. We 
> subsequently got reconnected to a different zookeeper server. When we tried 
> to read the same path from this new zookeeper server we are getting a stale 
> value.
> Specifically, we are reading "/binchanges" and originally got back a value of 
> "3" from the first server. After we lost connection and reconnected before 
> the session timeout, we then read "/binchanges" from the new server and got 
> back a value of "2". In our code path we never set this value from 3 to 2. We 
> throw an assertion if the value ever goes backwards. Which is how we caught 
> this error. 
> It's my understanding of the single system image guarantee that this should 
> never be allowed. I realize that the single system image guarantee is still 
> quorum based and it's certainly possible that a minority of the ensemble may 
> have stale data. However, I also believe that each client has to send the 
> highest zxid it's seen as part of its connection request to the server. And 
> if the server it's connecting to has a smaller zxid than the value the client 
> sends, then the connection request should be refused.
> Assuming I have all of that correct, then I'm at a loss for how this 
> happened. 
> The failure happened around Jun  4 08:13:44. Just before that, at June  4 
> 08:13:30 there was a round of leader election. During that round of leader 
> election we voted server with id=4 and zxid=0x300001c4c. This then led to a 
> new zxid=0x400000001. The new leader sends a diff to all the servers 
> including the one we will soon read the stale data from (id=2). Server with 
> ID=2's log files also reflect that as of 08:13:43 it was up to date and 
> current with an UPTODATE message.
> I'm going to attach log files from all 5 ensemble nodes. I also used 
> zktreeutil to dump the database out for the 5 ensemble nodes. I diff'd those, 
> and compared them all for correctness. 1 of the nodes (id=2) has a massively 
> divergent zktreeutil dump than the other 4 nodes even though it received the 
> diff from the new leader.
> In the attachments there are 5 nodes. I will number each log file by it's 
> zookeeper id, e.g. node4.log.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to