Github user mashengchen commented on a diff in the pull request:
https://github.com/apache/trafodion/pull/1427#discussion_r166504107
--- Diff: dcs/src/main/java/org/trafodion/dcs/master/DcsMaster.java ---
@@ -111,11 +104,59 @@ public DcsMaster(String[] args) {
trafodionHome = System.getProperty(Constants.DCS_TRAFODION_HOME);
jvmShutdownHook = new JVMShutdownHook();
Runtime.getRuntime().addShutdownHook(jvmShutdownHook);
- thrd = new Thread(this);
- thrd.start();
+
+ ExecutorService executorService = Executors.newFixedThreadPool(1);
+ CompletionService<Integer> completionService = new
ExecutorCompletionService<Integer>(executorService);
+
+ while (true) {
+ completionService.submit(this);
+ Future<Integer> f = null;
+ try {
+ f = completionService.take();
+ if (f != null) {
+ Integer status = f.get();
+ if (status <= 0) {
+ System.exit(status);
+ } else {
+ // 35000 * 15mins ~= 1 years
+ RetryCounter retryCounter =
RetryCounterFactory.create(35000, 15, TimeUnit.MINUTES);
+ while (true) {
+ try {
+ ZkClient tmpZkc = new ZkClient();
+ tmpZkc.connect();
+ tmpZkc.close();
+ tmpZkc = null;
+ LOG.info("Connected to ZooKeeper
successful, restart DCS Master.");
+ // reset lock
+ isLeader = new CountDownLatch(1);
+ break;
--- End diff --
this logic is for when dcsmaster return with network erro situation.
in the logic , it will try to connect to zk
if it can't conn ( tmpZkc.connect(); ) , there will in catch block and do
retry
if it connect to zk, then dcsmaster will run call() method again, in the
time dcsmaster rework ,there must hava another backup-master working ( because
there must one dcs master work and current master lose network ,then
backupmaster take over the role) , so when dcsmaster rework , it will set
value in zk /rootpath/dcs/leader/ then hang by lock "isLeader = new
CountDownLatch(1);"
---