wangxiaojing123 commented on a change in pull request #664: KYLIN-4017 Build 
engine get zk(zookeeper) lock failed when building job, it causes the whole 
build engine doesn't work
URL: https://github.com/apache/kylin/pull/664#discussion_r290618823
 
 

 ##########
 File path: core-common/src/main/java/org/apache/kylin/common/util/ZKUtil.java
 ##########
 @@ -84,7 +84,7 @@ public void onRemoval(RemovalNotification<String, 
CuratorFramework> notification
                         logger.error("Error at closing " + curator, ex);
                     }
                 }
-            }).expireAfterWrite(1, TimeUnit.DAYS).build();
+            }).expireAfterWrite(10000, TimeUnit.DAYS).build();//never expired
 
 Review comment:
   > if the cache expire after 1 day,then will run  curator.close(),in other 
words the newZookeeperClient will closed, but the newZookeeperClient should be 
as start state all the build engine lifecycle ,it used when build segment.if 
newZookeeperClient.state!=start,it can't get zk lock ,can't build :
   
   DistributedScheduler
   
   ```java
   public void run() {
               try (SetThreadName ignored = new SetThreadName("Scheduler %s Job 
%s",
                       System.identityHashCode(DistributedScheduler.this), 
executable.getId())) {
                   if (jobLock.lock(getLockPath(executable.getId()))) {
                       logger.info(executable.toString() + " scheduled in 
server: " + serverName);
   
                       context.addRunningJob(executable);
                       jobWithLocks.add(executable.getId());
                       executable.execute(context);
                   }
               } catch (ExecuteException e) {
                   logger.error("ExecuteException job:" + executable.getId() + 
" in server: " + serverName, e);
               } catch (Exception e) {
                   logger.error("unknown error execute job:" + 
executable.getId() + " in server: " + serverName, e);
               } finally {
                   context.removeRunningJob(executable);
                   releaseJobLock(executable);
                   // trigger the next step asap
                   fetcherPool.schedule(fetcher, 0, TimeUnit.SECONDS);
               }
           }
   ```
   
   
    ZookeeperDistributedLock:
    ```java
   public boolean lock(String lockPath) {
           logger.debug("{} trying to lock {}", client, lockPath);
           try {
               
curator.create().creatingParentsIfNeeded().withMode(CreateMode.EPHEMERAL).forPath(lockPath,
 clientBytes);
           } catch (KeeperException.NodeExistsException ex) {
               logger.debug("{} see {} is already locked", client, lockPath);
           } catch (Exception ex) {
               throw new IllegalStateException("Error while " + client + " 
trying to lock " + lockPath, ex);
           }
   
           String lockOwner = peekLock(lockPath);
           if (client.equals(lockOwner)) {
               logger.info("{} acquired lock at {}", client, lockPath);
               return true;
           } else {
               logger.debug("{} failed to acquire lock at {}, which is held by 
{}", client, lockPath, lockOwner);
               return false;
           }
       }
   ```
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to