[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15364131#comment-15364131
 ] 

Flavio Junqueira commented on ZOOKEEPER-2466:
---------------------------------------------

[~shralex] Good catch, it is exactly the same problem. The description about a 
list of two servers, but it is an issue in general that we skip one server of 
the list every time.

[~hanm] The test case isn't related to reconfiguration, that's correct. 
However, zh->reconfig is set to 1 initially according to the logic we have 
implemented. That's what I observed while tracing the execution. The fact that 
it is set to 1 initially actually changes the lists we are getting the server 
addresses from (there are _old and _new lists in the handle).

There isn't much in the output, but here is a sample:

{noformat}
2016-07-05 18:35:50,174:42240:ZOO_INFO@log_env@1027: Client 
environment:zookeeper.version=zookeeper C client 3.5.2
2016-07-05 18:35:50,174:42240:ZOO_INFO@log_env@1031: Client 
environment:host.name=fpj-test-apache-01
2016-07-05 18:35:50,174:42240:ZOO_INFO@log_env@1038: Client 
environment:os.name=Linux
2016-07-05 18:35:50,174:42240:ZOO_INFO@log_env@1039: Client 
environment:os.arch=4.4.0-28-generic
2016-07-05 18:35:50,174:42240:ZOO_INFO@log_env@1040: Client 
environment:os.version=#47-Ubuntu SMP Fri Jun 24 10:09:13 UTC 2016
2016-07-05 18:35:50,174:42240:ZOO_INFO@log_env@1048: Client 
environment:user.name=fpj
2016-07-05 18:35:50,174:42240:ZOO_INFO@log_env@1056: Client 
environment:user.home=/root
2016-07-05 18:35:50,174:42240:ZOO_INFO@log_env@1068: Client 
environment:user.dir=/home/fpj/code/zookeeper-3.5.2-alpha/src/c
2016-07-05 18:35:50,174:42240:ZOO_INFO@zookeeper_init_internal@1111: Initiating 
client connection, host=127.0.0.1:22182,127.0.0.1:22181 sessionTimeout=10000 
watcher=0x447050 sessionId=0 sessionPasswd=<null> context=0x7ffcc708fec0 flags=0
2016-07-05 18:35:51,174:42240:ZOO_WARN@get_next_server_in_reconfig@1256: [OLD] 
count=0 capacity=0 next=0 hasnext=0
2016-07-05 18:35:51,174:42240:ZOO_WARN@get_next_server_in_reconfig@1259: [NEW] 
count=2 capacity=16 next=0 hasnext=1
2016-07-05 18:35:51,175:42240:ZOO_WARN@get_next_server_in_reconfig@1268: Using 
next from NEW=127.0.0.1:22182
2016-07-05 18:35:51,175:42240:ZOO_ERROR@handle_socket_error_msg@2353: Socket 
[127.0.0.1:22182] zk retcode=-4, errno=111(Connection refused): server refused 
to accept the client
2016-07-05 18:35:51,175:42240:ZOO_WARN@get_next_server_in_reconfig@1256: [OLD] 
count=0 capacity=0 next=0 hasnext=0
2016-07-05 18:35:51,175:42240:ZOO_WARN@get_next_server_in_reconfig@1259: [NEW] 
count=2 capacity=16 next=1 hasnext=1
2016-07-05 18:35:51,175:42240:ZOO_WARN@get_next_server_in_reconfig@1268: Using 
next from NEW=127.0.0.1:22181
2016-07-05 18:35:51,175:42240:ZOO_ERROR@handle_socket_error_msg@2353: Socket 
[127.0.0.1:22181] zk retcode=-4, errno=111(Connection refused): server refused 
to accept the client
2016-07-05 18:35:51,175:42240:ZOO_WARN@get_next_server_in_reconfig@1256: [OLD] 
count=0 capacity=0 next=0 hasnext=0
2016-07-05 18:35:51,175:42240:ZOO_WARN@get_next_server_in_reconfig@1259: [NEW] 
count=2 capacity=16 next=2 hasnext=0
2016-07-05 18:35:51,175:42240:ZOO_WARN@get_next_server_in_reconfig@1279: Failed 
to find either new or old
2016-07-05 18:35:51,175:42240:ZOO_ERROR@handle_socket_error_msg@2353: Socket 
[127.0.0.1:22182] zk retcode=-4, errno=111(Connection refused): server refused 
to accept the client
2016-07-05 18:35:51,175:42240:ZOO_ERROR@handle_socket_error_msg@2353: Socket 
[127.0.0.1:22182] zk retcode=-4, errno=111(Connection refused): server refused 
to accept the client
2016-07-05 18:35:51,176:42240:ZOO_ERROR@handle_socket_error_msg@2353: Socket 
[127.0.0.1:22182] zk retcode=-4, errno=111(Connection refused): server refused 
to accept the client
2016-07-05 18:35:51,176:42240:ZOO_ERROR@handle_socket_error_msg@2353: Socket 
[127.0.0.1:22182] zk retcode=-4, errno=111(Connection refused): server refused 
to accept the client
<This line keeps repeating>
{noformat}

No server seems to be up for the client to connect, which I don't understand 
the reason, but I've focused mostly on why the address is the same after some 
point rather than alternating between the two addresses.

> Client skips servers when trying to connect
> -------------------------------------------
>
>                 Key: ZOOKEEPER-2466
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2466
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: c client
>            Reporter: Flavio Junqueira
>            Assignee: Flavio Junqueira
>            Priority: Critical
>             Fix For: 3.5.3, 3.6.0
>
>
> I've been looking at {{Zookeeper_simpleSystem::testFirstServerDown}} and I 
> observed the following behavior. The list of servers to connect contains two 
> servers, let's call them S1 and S2. The client never connects, but the odd 
> bit is the sequence of servers that the client tries to connect to:
> {noformat}
> S1
> S2
> S1
> S1
> S1
> <keeps repeating S1>
> {noformat}
> It intrigued me that S2 is only tried once and never again. Checking the 
> code, here is what happens. Initially, {{zh->reconfig}} is 1, so in 
> {{zoo_cycle_next_server}} we return an address from 
> {{get_next_server_in_reconfig}}, which is taken from {{zh->addrs_new}} in 
> this test case. The attempt to connect fails, and {{handle_error}} is invoked 
> in the error handling path. {{handle_error}} actually invokes 
> {{addrvec_next}} which changes the address pointer to the next server on the 
> list.
> After two attempts, it decides that it has tried all servers in 
> {{zoo_cycle_next_server}} and sets {{zh->reconfig}} to zero. Once 
> {{zh->reconfig == 0}}, we have that each call to {{zoo_cycle_next_server}} 
> moves the address pointer to the next server in {{zh->addrs}}. But, given 
> that {{handle_error}} also moves the pointer to the next server, we end up 
> moving the pointer ahead twice upon every failed attempt to connect, which is 
> wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to