[jira] [Commented] (KYLIN-1819) Exception swallowed when start DefaultScheduler fail

Wang Ken (JIRA) Tue, 05 Jul 2016 04:43:59 -0700

    [ 
https://issues.apache.org/jira/browse/KYLIN-1819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15362382#comment-15362382
 ]


Wang Ken commented on KYLIN-1819:
---------------------------------

I have different opinions here. A distributed system should be designed to be 
auto recover from different kind of failures without too much operation effort. 
Many distributed system has a role as the controller/coordinator. Some of the 
implementations leverage zookeeper for leader election. During first time 
initialization, if they encounter zk connections issue, usually they make it 
fail fast.  But during running time, if they encounter zk connection loss, just 
retry util zk connection comes back.

Back to Kylin's Job engine, for first time initialization, if it sees zk 
connection issue, it should fail fast. But if it can't get Job lock, it should 
wait until the competitor release the lock.
And if it sees zk connection loss during runtime, it should give up the lock 
and shutdown the scheduling thread pools and wait for the connection back. 
Curator Client framework will retry the underline zk client. If it detect zk 
reconnected, just rejoin the lock competition. Once it get the lock again, it 
restart the scheduling thread pools and do the scheduling work.
The HA implementation is not that complicated as we think. Actually Curator 
Client framework has already implement some recipes for leader election and 
other distributed coordination stuff and we don't need to implement it by our 
own with the low level Mutex lock. 



> Exception swallowed when start DefaultScheduler fail
> ----------------------------------------------------
>
>                 Key: KYLIN-1819
>                 URL: https://issues.apache.org/jira/browse/KYLIN-1819
>             Project: Kylin
>          Issue Type: Bug
>          Components: Job Engine
>    Affects Versions: v1.5.1, v1.5.2
>            Reporter: Ma Gang
>            Assignee: Ma Gang
>         Attachments: fix_swallow_scheduler_start_exception.patch
>
>
> Start job scheduler need to acquire job lock from zookeeper, when lock 
> acquire fail, it will throw an IllegalException, but because the scheduler is 
> started in a new thread, the exception thrown by the thread will be ignored, 
> and the server still started successfully, and no exceptions are logged. That 
> make it hard for trouble shooting, should change to make server started fail 
> when the scheduler started fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KYLIN-1819) Exception swallowed when start DefaultScheduler fail

Reply via email to