[ https://issues.apache.org/jira/browse/TRAFODION-2588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967010#comment-15967010 ]
ASF GitHub Bot commented on TRAFODION-2588: ------------------------------------------- Github user mkby commented on a diff in the pull request: https://github.com/apache/incubator-trafodion/pull/1061#discussion_r111298734 --- Diff: core/sqf/monitor/linux/zclient.cxx --- @@ -53,6 +53,9 @@ // The monitors register their znodes under the cluster znode #define ZCLIENT_CLUSTER_ZNODE "/cluster" +// zookeeper connection retry count +#define ZOOKEEPER_RETRY_COUNT 3 --- End diff -- Based on the test result, actually retry for once is enough for this case. So it's not quite necessary to modify this value. > monitor failed to start if part of zookeeper server is down > ----------------------------------------------------------- > > Key: TRAFODION-2588 > URL: https://issues.apache.org/jira/browse/TRAFODION-2588 > Project: Apache Trafodion > Issue Type: Bug > Components: foundation > Reporter: Eason Zhang > Assignee: Eason Zhang > > monitor gets zookpper node list from env $ZOOKEEPER_NODES. If one of the > zookeeper server in this node list is down, monitor will not start with > 'ZCONNECTIONLOSS' error: > 2017-04-11 17:15:49,351, ERROR, ZOO, Node Number: 4294967295,, PIN: 21106 , > Process Name: zooclient,,, TID: 21106, Message ID: 101370401, > [CZClient::MakeClusterZNodes], zoo_exists() failed with error ZCONNECTIONLOSS > That is because zookeeper c client randomly picks up a zk server to connect > with based on its own node selection algorithm. So the simple solution for > monitor is that we do a retry when connection failed, let zookeeper c client > to pick up another server from the list. -- This message was sent by Atlassian JIRA (v6.3.15#6346)