[
https://issues.apache.org/jira/browse/ZOOKEEPER-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15364131#comment-15364131
]
Flavio Junqueira commented on ZOOKEEPER-2466:
---------------------------------------------
[~shralex] Good catch, it is exactly the same problem. The description about a
list of two servers, but it is an issue in general that we skip one server of
the list every time.
[~hanm] The test case isn't related to reconfiguration, that's correct.
However, zh->reconfig is set to 1 initially according to the logic we have
implemented. That's what I observed while tracing the execution. The fact that
it is set to 1 initially actually changes the lists we are getting the server
addresses from (there are _old and _new lists in the handle).
There isn't much in the output, but here is a sample:
{noformat}
2016-07-05 18:35:50,174:42240:ZOO_INFO@log_env@1027: Client
environment:zookeeper.version=zookeeper C client 3.5.2
2016-07-05 18:35:50,174:42240:ZOO_INFO@log_env@1031: Client
environment:host.name=fpj-test-apache-01
2016-07-05 18:35:50,174:42240:ZOO_INFO@log_env@1038: Client
environment:os.name=Linux
2016-07-05 18:35:50,174:42240:ZOO_INFO@log_env@1039: Client
environment:os.arch=4.4.0-28-generic
2016-07-05 18:35:50,174:42240:ZOO_INFO@log_env@1040: Client
environment:os.version=#47-Ubuntu SMP Fri Jun 24 10:09:13 UTC 2016
2016-07-05 18:35:50,174:42240:ZOO_INFO@log_env@1048: Client
environment:user.name=fpj
2016-07-05 18:35:50,174:42240:ZOO_INFO@log_env@1056: Client
environment:user.home=/root
2016-07-05 18:35:50,174:42240:ZOO_INFO@log_env@1068: Client
environment:user.dir=/home/fpj/code/zookeeper-3.5.2-alpha/src/c
2016-07-05 18:35:50,174:42240:ZOO_INFO@zookeeper_init_internal@1111: Initiating
client connection, host=127.0.0.1:22182,127.0.0.1:22181 sessionTimeout=10000
watcher=0x447050 sessionId=0 sessionPasswd=<null> context=0x7ffcc708fec0 flags=0
2016-07-05 18:35:51,174:42240:ZOO_WARN@get_next_server_in_reconfig@1256: [OLD]
count=0 capacity=0 next=0 hasnext=0
2016-07-05 18:35:51,174:42240:ZOO_WARN@get_next_server_in_reconfig@1259: [NEW]
count=2 capacity=16 next=0 hasnext=1
2016-07-05 18:35:51,175:42240:ZOO_WARN@get_next_server_in_reconfig@1268: Using
next from NEW=127.0.0.1:22182
2016-07-05 18:35:51,175:42240:ZOO_ERROR@handle_socket_error_msg@2353: Socket
[127.0.0.1:22182] zk retcode=-4, errno=111(Connection refused): server refused
to accept the client
2016-07-05 18:35:51,175:42240:ZOO_WARN@get_next_server_in_reconfig@1256: [OLD]
count=0 capacity=0 next=0 hasnext=0
2016-07-05 18:35:51,175:42240:ZOO_WARN@get_next_server_in_reconfig@1259: [NEW]
count=2 capacity=16 next=1 hasnext=1
2016-07-05 18:35:51,175:42240:ZOO_WARN@get_next_server_in_reconfig@1268: Using
next from NEW=127.0.0.1:22181
2016-07-05 18:35:51,175:42240:ZOO_ERROR@handle_socket_error_msg@2353: Socket
[127.0.0.1:22181] zk retcode=-4, errno=111(Connection refused): server refused
to accept the client
2016-07-05 18:35:51,175:42240:ZOO_WARN@get_next_server_in_reconfig@1256: [OLD]
count=0 capacity=0 next=0 hasnext=0
2016-07-05 18:35:51,175:42240:ZOO_WARN@get_next_server_in_reconfig@1259: [NEW]
count=2 capacity=16 next=2 hasnext=0
2016-07-05 18:35:51,175:42240:ZOO_WARN@get_next_server_in_reconfig@1279: Failed
to find either new or old
2016-07-05 18:35:51,175:42240:ZOO_ERROR@handle_socket_error_msg@2353: Socket
[127.0.0.1:22182] zk retcode=-4, errno=111(Connection refused): server refused
to accept the client
2016-07-05 18:35:51,175:42240:ZOO_ERROR@handle_socket_error_msg@2353: Socket
[127.0.0.1:22182] zk retcode=-4, errno=111(Connection refused): server refused
to accept the client
2016-07-05 18:35:51,176:42240:ZOO_ERROR@handle_socket_error_msg@2353: Socket
[127.0.0.1:22182] zk retcode=-4, errno=111(Connection refused): server refused
to accept the client
2016-07-05 18:35:51,176:42240:ZOO_ERROR@handle_socket_error_msg@2353: Socket
[127.0.0.1:22182] zk retcode=-4, errno=111(Connection refused): server refused
to accept the client
<This line keeps repeating>
{noformat}
No server seems to be up for the client to connect, which I don't understand
the reason, but I've focused mostly on why the address is the same after some
point rather than alternating between the two addresses.
> Client skips servers when trying to connect
> -------------------------------------------
>
> Key: ZOOKEEPER-2466
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2466
> Project: ZooKeeper
> Issue Type: Bug
> Components: c client
> Reporter: Flavio Junqueira
> Assignee: Flavio Junqueira
> Priority: Critical
> Fix For: 3.5.3, 3.6.0
>
>
> I've been looking at {{Zookeeper_simpleSystem::testFirstServerDown}} and I
> observed the following behavior. The list of servers to connect contains two
> servers, let's call them S1 and S2. The client never connects, but the odd
> bit is the sequence of servers that the client tries to connect to:
> {noformat}
> S1
> S2
> S1
> S1
> S1
> <keeps repeating S1>
> {noformat}
> It intrigued me that S2 is only tried once and never again. Checking the
> code, here is what happens. Initially, {{zh->reconfig}} is 1, so in
> {{zoo_cycle_next_server}} we return an address from
> {{get_next_server_in_reconfig}}, which is taken from {{zh->addrs_new}} in
> this test case. The attempt to connect fails, and {{handle_error}} is invoked
> in the error handling path. {{handle_error}} actually invokes
> {{addrvec_next}} which changes the address pointer to the next server on the
> list.
> After two attempts, it decides that it has tried all servers in
> {{zoo_cycle_next_server}} and sets {{zh->reconfig}} to zero. Once
> {{zh->reconfig == 0}}, we have that each call to {{zoo_cycle_next_server}}
> moves the address pointer to the next server in {{zh->addrs}}. But, given
> that {{handle_error}} also moves the pointer to the next server, we end up
> moving the pointer ahead twice upon every failed attempt to connect, which is
> wrong.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)