[ 
https://issues.apache.org/jira/browse/TAJO-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14542023#comment-14542023
 ] 

ASF GitHub Bot commented on TAJO-1586:
--------------------------------------

Github user jinossy commented on a diff in the pull request:

    https://github.com/apache/tajo/pull/566#discussion_r30236952
  
    --- Diff: tajo-core/src/main/java/org/apache/tajo/worker/TajoWorker.java ---
    @@ -322,9 +323,33 @@ public void serviceStart() throws Exception {
         startJvmPauseMonitor();
     
         tajoMasterInfo = new TajoMasterInfo();
    +
    +    // Most of the time, TajoWorker will start before TajoMaster. After 
TajoMaster non-HA, it doesn't matter.
    +    // But in HA, TajoWorker will fail to start because TajoWorker 
couldn't find TajoMaster address on shared storage.
    +    // Thus, TajoWorker need to try to find the address for a certain 
period of time.
         if (systemConf.getBoolVar(TajoConf.ConfVars.TAJO_MASTER_HA_ENABLE)) {
    -      
tajoMasterInfo.setTajoMasterAddress(serviceTracker.getUmbilicalAddress());
    -      
tajoMasterInfo.setWorkerResourceTrackerAddr(serviceTracker.getResourceTrackerAddress());
    +      long retryWaitTime = 
systemConf.getLongVar(TajoConf.ConfVars.TAJO_MASTER_HA_CLIENT_RETRY_WAIT_TIME);
    +      int retryMaxNum = 
systemConf.getIntVar(ConfVars.TAJO_MASTER_HA_CLIENT_RETRY_MAX_NUM);
    +      int retryNum = 1;
    +
    +      boolean done = false;
    +
    +      while (!done && retryNum < retryMaxNum) {
    +        try {
    +          
tajoMasterInfo.setTajoMasterAddress(serviceTracker.getUmbilicalAddress());
    +          
tajoMasterInfo.setWorkerResourceTrackerAddr(serviceTracker.getResourceTrackerAddress());
    +          done = true;
    +          LOG.info("Find a new TajoMaster (" + 
tajoMasterInfo.getTajoMasterAddress() + ")");
    +        } catch (ServiceTrackerException e) {
    +          LOG.warn("Retry TajoMaster address (" + retryNum + ")");
    +          Thread.sleep(retryWaitTime);
    +        }
    +        retryNum++;
    +        if (retryNum == retryMaxNum) {
    +          LOG.error("ERROR: the maximum retry (" + retryNum + ") to read 
TajoMaster address");
    +          break;
    +        }
    --- End diff --
    
    Can you move the retry codes to HdfsServiceTracker ? because backup master 
is registered in PingChecker


> TajoMaster HA startup failure on Yarn.
> --------------------------------------
>
>                 Key: TAJO-1586
>                 URL: https://issues.apache.org/jira/browse/TAJO-1586
>             Project: Tajo
>          Issue Type: Bug
>          Components: tajo master
>    Affects Versions: 0.10.0
>            Reporter: Jaehwa Jung
>            Assignee: Jaehwa Jung
>             Fix For: 0.11.0, 0.10.1
>
>         Attachments: TAJO-1586.patch
>
>
> I tried to deploy Tajo on YARN with Slider. But I couldn't deploy Tajo 
> because of TajoMaster HA failure. TajoWorker failed to load TajoMaster 
> address as follows.
> {code:xml}
> 2015-04-28 04:52:22,266 INFO org.apache.hadoop.service.AbstractService: 
> Service org.apache.tajo.worker.TajoWorker failed in state STARTED; cause: 
> org.apache.tajo.service.ServiceTrackerException: 
> org.apache.tajo.service.ServiceTrackerException: No active master entry
> org.apache.tajo.service.ServiceTrackerException: 
> org.apache.tajo.service.ServiceTrackerException: No active master entry
>       at 
> org.apache.tajo.ha.HdfsServiceTracker.getAddressElements(HdfsServiceTracker.java:441)
>       at 
> org.apache.tajo.ha.HdfsServiceTracker.getUmbilicalAddress(HdfsServiceTracker.java:348)
>       at org.apache.tajo.worker.TajoWorker.serviceStart(TajoWorker.java:318)
>       at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>       at org.apache.tajo.worker.TajoWorker.startWorker(TajoWorker.java:141)
>       at org.apache.tajo.worker.TajoWorker.main(TajoWorker.java:627)
> Caused by: org.apache.tajo.service.ServiceTrackerException: No active master 
> entry
>       at 
> org.apache.tajo.ha.HdfsServiceTracker.getAddressElements(HdfsServiceTracker.java:413)
>       ... 5 more
> 2015-04-28 04:52:22,307 INFO org.apache.hadoop.service.AbstractService: 
> Service WorkerHeartbeatService failed in state STOPPED; cause: 
> java.lang.NullPointerException
> java.lang.NullPointerException
>       at 
> org.apache.tajo.worker.WorkerHeartbeatService$WorkerHeartbeatThread.access$000(WorkerHeartbeatService.java:101)
>       at 
> org.apache.tajo.worker.WorkerHeartbeatService.serviceStop(WorkerHeartbeatService.java:90)
>       at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>       at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
>       at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
>       at 
> org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
>       at 
> org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
>       at org.apache.tajo.worker.TajoWorker.serviceStop(TajoWorker.java:375)
>       at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>       at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
>       at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
>       at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:203)
>       at org.apache.tajo.worker.TajoWorker.startWorker(TajoWorker.java:141)
>       at org.apache.tajo.worker.TajoWorker.main(TajoWorker.java:627){code}
> I think that the cause of this failure is time difference between TajoMaster 
> and TajoWorker. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to