[ https://issues.apache.org/jira/browse/TRAFODION-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388986#comment-16388986 ]
ASF GitHub Bot commented on TRAFODION-2940: ------------------------------------------- Github user kevinxu021 commented on a diff in the pull request: https://github.com/apache/trafodion/pull/1427#discussion_r172729543 --- Diff: dcs/src/main/java/org/trafodion/dcs/master/DcsMaster.java --- @@ -111,11 +104,59 @@ public DcsMaster(String[] args) { trafodionHome = System.getProperty(Constants.DCS_TRAFODION_HOME); jvmShutdownHook = new JVMShutdownHook(); Runtime.getRuntime().addShutdownHook(jvmShutdownHook); - thrd = new Thread(this); - thrd.start(); + + ExecutorService executorService = Executors.newFixedThreadPool(1); + CompletionService<Integer> completionService = new ExecutorCompletionService<Integer>(executorService); + + while (true) { + completionService.submit(this); + Future<Integer> f = null; + try { + f = completionService.take(); + if (f != null) { + Integer status = f.get(); + if (status <= 0) { + System.exit(status); + } else { + // 35000 * 15mins ~= 1 years + RetryCounter retryCounter = RetryCounterFactory.create(35000, 15, TimeUnit.MINUTES); + while (true) { + try { + ZkClient tmpZkc = new ZkClient(); + tmpZkc.connect(); + tmpZkc.close(); + tmpZkc = null; + LOG.info("Connected to ZooKeeper successful, restart DCS Master."); + // reset lock + isLeader = new CountDownLatch(1); + break; --- End diff -- As we discussed, Zookeeper connection lost has been covered by session expired event, so this loop is useless. > In HA env, one node lose network, when recover, trafci can't use > ---------------------------------------------------------------- > > Key: TRAFODION-2940 > URL: https://issues.apache.org/jira/browse/TRAFODION-2940 > Project: Apache Trafodion > Issue Type: Bug > Affects Versions: any > Reporter: mashengchen > Assignee: mashengchen > Priority: Major > Fix For: 2.3 > > > In HA env, if one node lose network for a long time , once network recover, > there will have two floating ip, two working dcs master, and trafci can't be > use. -- This message was sent by Atlassian JIRA (v7.6.3#76005)