[
https://issues.apache.org/jira/browse/ZOOKEEPER-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Germán Blanco updated ZOOKEEPER-1057:
-------------------------------------
Attachment: ZOOKEEPER-1057.patch
The test is simpler and looks better if integrated into TestClient.cc.
The attached patch can be applied both to trunk and branch 3.4.
With this version, the test case passes for the single threaded version, but
for the multithreaded version it hangs forever (or at least more than a few
minutes).
> zookeeper c-client, connection to offline server fails to successfully
> fallback to second zk host
> -------------------------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-1057
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1057
> Project: ZooKeeper
> Issue Type: Bug
> Components: c client
> Affects Versions: 3.3.1, 3.3.2, 3.3.3
> Environment: snowdutyrise-lm ~/-> uname -a
> Darwin snowdutyrise-lm 9.8.0 Darwin Kernel Version 9.8.0: Wed Jul 15 16:55:01
> PDT 2009; root:xnu-1228.15.4~1/RELEASE_I386 i386
> also observed on:
> 2.6.35-28-server 49-Ubuntu SMP Tue Mar 1 14:55:37 UTC 2011
> Reporter: Woody Anderson
> Assignee: Michi Mutsuzaki
> Priority: Blocker
> Fix For: 3.4.6, 3.5.0
>
> Attachments: ZOOKEEPER-1057-b3.4.patch, ZOOKEEPER-1057.patch,
> ZOOKEEPER-1057.patch, ZOOKEEPER-1057.patch, ZOOKEEPER-1057.patch
>
>
> Hello, I'm a contributor for the node.js zookeeper module:
> https://github.com/yfinkelstein/node-zookeeper
> i'm using zk 3.3.3 for the purposes of this issue, but i have validated it
> fails on 3.3.1 and 3.3.2
> i'm having an issue when trying to connect when one of my zookeeper servers
> is offline.
> if the first server attempted is online, all is good.
> if the offline server is attempted first, then the client is never able to
> connect to _any_ server.
> inside zookeeper.c a connection loss (-4) is received, the socket is closed
> and buffers are cleaned up, it then attempts the next server in the list,
> creates a new socket (which gets the same fd as the previously closed socket)
> and connecting fails, and it continues to fail seemingly forever.
> The nature of this "fail" is not that it gets -4 connection loss errors, but
> that zookeeper_interest doesn't find anything going on on the socket before
> the user provided timeout kicks things out. I don't want to have to wait 5
> minutes, even if i could make myself.
> this is the message that follows the connection loss:
> 2011-04-27 23:18:28,355:13485:ZOO_ERROR@handle_socket_error_msg@1530: Socket
> [127.0.0.1:5020] zk retcode=-7, errno=60(Operation timed out): connection
> timed out (exceeded timeout by 3ms)
> 2011-04-27 23:18:28,355:13485:ZOO_ERROR@yield@213: yield:zookeeper_interest
> returned error: -7 - operation timeout
> While investigating, i decided to comment out close(zh->fd) in handle_error
> (zookeeper.c#1153)
> now everything works (obviously i'm leaking an fd). Connection the the second
> host works immediately.
> this is the behavior i'm looking for, though i clearly don't want to leak the
> fd, so i'm wondering why the fd re-use is causing this issue.
> close() is not returning an error (i checked even though current code assumes
> success).
> i'm on osx 10.6.7
> i tried adding a setsockopt so_linger (though i didn't want that to be a
> solution), it didn't work.
> full debug traces are included in issue here:
> https://github.com/yfinkelstein/node-zookeeper/issues/6
--
This message was sent by Atlassian JIRA
(v6.1.4#6159)