Github user jinossy commented on a diff in the pull request:

    https://github.com/apache/tajo/pull/566#discussion_r30236952
  
    --- Diff: tajo-core/src/main/java/org/apache/tajo/worker/TajoWorker.java ---
    @@ -322,9 +323,33 @@ public void serviceStart() throws Exception {
         startJvmPauseMonitor();
     
         tajoMasterInfo = new TajoMasterInfo();
    +
    +    // Most of the time, TajoWorker will start before TajoMaster. After 
TajoMaster non-HA, it doesn't matter.
    +    // But in HA, TajoWorker will fail to start because TajoWorker 
couldn't find TajoMaster address on shared storage.
    +    // Thus, TajoWorker need to try to find the address for a certain 
period of time.
         if (systemConf.getBoolVar(TajoConf.ConfVars.TAJO_MASTER_HA_ENABLE)) {
    -      
tajoMasterInfo.setTajoMasterAddress(serviceTracker.getUmbilicalAddress());
    -      
tajoMasterInfo.setWorkerResourceTrackerAddr(serviceTracker.getResourceTrackerAddress());
    +      long retryWaitTime = 
systemConf.getLongVar(TajoConf.ConfVars.TAJO_MASTER_HA_CLIENT_RETRY_WAIT_TIME);
    +      int retryMaxNum = 
systemConf.getIntVar(ConfVars.TAJO_MASTER_HA_CLIENT_RETRY_MAX_NUM);
    +      int retryNum = 1;
    +
    +      boolean done = false;
    +
    +      while (!done && retryNum < retryMaxNum) {
    +        try {
    +          
tajoMasterInfo.setTajoMasterAddress(serviceTracker.getUmbilicalAddress());
    +          
tajoMasterInfo.setWorkerResourceTrackerAddr(serviceTracker.getResourceTrackerAddress());
    +          done = true;
    +          LOG.info("Find a new TajoMaster (" + 
tajoMasterInfo.getTajoMasterAddress() + ")");
    +        } catch (ServiceTrackerException e) {
    +          LOG.warn("Retry TajoMaster address (" + retryNum + ")");
    +          Thread.sleep(retryWaitTime);
    +        }
    +        retryNum++;
    +        if (retryNum == retryMaxNum) {
    +          LOG.error("ERROR: the maximum retry (" + retryNum + ") to read 
TajoMaster address");
    +          break;
    +        }
    --- End diff --
    
    Can you move the retry codes to HdfsServiceTracker ? because backup master 
is registered in PingChecker


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to