[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Han updated ZOOKEEPER-2466:
-----------------------------------
    Attachment: ZOOKEEPER-2466.patch

Hi [~fpj] a late update on this one:
Designed a test case to always make the bug reveal itself - the reason why we 
did not see the bug happen deterministically is because of the probabilistic 
nature of {code}get_next_server_in_reconfig{code} where it might return a 
working server, or not. The bug can be deterministically reproduced if we 
taking the probability out of equation by always making 
{code}get_next_server_in_reconfig{code} return none zero, and this can be 
achieved if all servers are down. So, the updated test case first make sure all 
servers are down and zk client can't get connected; then it started the server 
and verify client can connect. 

Tested with and without the patched change in zookeeper.c: without the change 
the new test always fail and with the change the new test passes my stress test 
of 300 runs.

> Client skips servers when trying to connect
> -------------------------------------------
>
>                 Key: ZOOKEEPER-2466
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2466
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: c client
>            Reporter: Flavio Junqueira
>            Assignee: Michael Han
>            Priority: Critical
>             Fix For: 3.5.3, 3.6.0
>
>         Attachments: ZOOKEEPER-2466.patch, ZOOKEEPER-2466.patch
>
>
> I've been looking at {{Zookeeper_simpleSystem::testFirstServerDown}} and I 
> observed the following behavior. The list of servers to connect contains two 
> servers, let's call them S1 and S2. The client never connects, but the odd 
> bit is the sequence of servers that the client tries to connect to:
> {noformat}
> S1
> S2
> S1
> S1
> S1
> <keeps repeating S1>
> {noformat}
> It intrigued me that S2 is only tried once and never again. Checking the 
> code, here is what happens. Initially, {{zh->reconfig}} is 1, so in 
> {{zoo_cycle_next_server}} we return an address from 
> {{get_next_server_in_reconfig}}, which is taken from {{zh->addrs_new}} in 
> this test case. The attempt to connect fails, and {{handle_error}} is invoked 
> in the error handling path. {{handle_error}} actually invokes 
> {{addrvec_next}} which changes the address pointer to the next server on the 
> list.
> After two attempts, it decides that it has tried all servers in 
> {{zoo_cycle_next_server}} and sets {{zh->reconfig}} to zero. Once 
> {{zh->reconfig == 0}}, we have that each call to {{zoo_cycle_next_server}} 
> moves the address pointer to the next server in {{zh->addrs}}. But, given 
> that {{handle_error}} also moves the pointer to the next server, we end up 
> moving the pointer ahead twice upon every failed attempt to connect, which is 
> wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to