[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15384597#comment-15384597
 ] 

Michael Han commented on ZOOKEEPER-2466:
----------------------------------------

[~fpj] 
bq. What happens in this case?

What will happen depends on the state of the new server that the client 
connects to after reconfig. 
* If the new server is a read only server (does not matter if this is the same 
read-only server the client previously connects to, or a new read-only server 
client reconnects to, after load balancing), then client will continue seeking 
a RW server, after reconfig is finished. 
* If the new server is a read-write server, then we are done.

Specifically, for the case where the current RO server is taken out during 
reconfig, the error handling logic will take care of retry connect to another 
server, so we will finally end up at the previous cases I just listed. Does 
this make sense to you?

bq. I was thinking if we need a test case for this.
I agree - actually I think this patch fix a bug where we could change state of 
zh->addr_cur during reconfig, without protection, so potential data races 
leading to undefined behavior. Would be good to have a test case cover this. 
Java client might have a similar issue (because RO was introduced before 
Reconfig feature.).
For existing test case I'll double check and see if we can add / improve it to 
have a deterministically failing case cover this scenario. I'll work on adding 
both cases.




> Client skips servers when trying to connect
> -------------------------------------------
>
>                 Key: ZOOKEEPER-2466
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2466
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: c client
>            Reporter: Flavio Junqueira
>            Assignee: Michael Han
>            Priority: Critical
>             Fix For: 3.5.3, 3.6.0
>
>         Attachments: ZOOKEEPER-2466.patch
>
>
> I've been looking at {{Zookeeper_simpleSystem::testFirstServerDown}} and I 
> observed the following behavior. The list of servers to connect contains two 
> servers, let's call them S1 and S2. The client never connects, but the odd 
> bit is the sequence of servers that the client tries to connect to:
> {noformat}
> S1
> S2
> S1
> S1
> S1
> <keeps repeating S1>
> {noformat}
> It intrigued me that S2 is only tried once and never again. Checking the 
> code, here is what happens. Initially, {{zh->reconfig}} is 1, so in 
> {{zoo_cycle_next_server}} we return an address from 
> {{get_next_server_in_reconfig}}, which is taken from {{zh->addrs_new}} in 
> this test case. The attempt to connect fails, and {{handle_error}} is invoked 
> in the error handling path. {{handle_error}} actually invokes 
> {{addrvec_next}} which changes the address pointer to the next server on the 
> list.
> After two attempts, it decides that it has tried all servers in 
> {{zoo_cycle_next_server}} and sets {{zh->reconfig}} to zero. Once 
> {{zh->reconfig == 0}}, we have that each call to {{zoo_cycle_next_server}} 
> moves the address pointer to the next server in {{zh->addrs}}. But, given 
> that {{handle_error}} also moves the pointer to the next server, we end up 
> moving the pointer ahead twice upon every failed attempt to connect, which is 
> wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to