[
https://issues.apache.org/jira/browse/KYLIN-1819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15362382#comment-15362382
]
Wang Ken commented on KYLIN-1819:
---------------------------------
I have different opinions here. A distributed system should be designed to be
auto recover from different kind of failures without too much operation effort.
Many distributed system has a role as the controller/coordinator. Some of the
implementations leverage zookeeper for leader election. During first time
initialization, if they encounter zk connections issue, usually they make it
fail fast. But during running time, if they encounter zk connection loss, just
retry util zk connection comes back.
Back to Kylin's Job engine, for first time initialization, if it sees zk
connection issue, it should fail fast. But if it can't get Job lock, it should
wait until the competitor release the lock.
And if it sees zk connection loss during runtime, it should give up the lock
and shutdown the scheduling thread pools and wait for the connection back.
Curator Client framework will retry the underline zk client. If it detect zk
reconnected, just rejoin the lock competition. Once it get the lock again, it
restart the scheduling thread pools and do the scheduling work.
The HA implementation is not that complicated as we think. Actually Curator
Client framework has already implement some recipes for leader election and
other distributed coordination stuff and we don't need to implement it by our
own with the low level Mutex lock.
> Exception swallowed when start DefaultScheduler fail
> ----------------------------------------------------
>
> Key: KYLIN-1819
> URL: https://issues.apache.org/jira/browse/KYLIN-1819
> Project: Kylin
> Issue Type: Bug
> Components: Job Engine
> Affects Versions: v1.5.1, v1.5.2
> Reporter: Ma Gang
> Assignee: Ma Gang
> Attachments: fix_swallow_scheduler_start_exception.patch
>
>
> Start job scheduler need to acquire job lock from zookeeper, when lock
> acquire fail, it will throw an IllegalException, but because the scheduler is
> started in a new thread, the exception thrown by the thread will be ignored,
> and the server still started successfully, and no exceptions are logged. That
> make it hard for trouble shooting, should change to make server started fail
> when the scheduler started fail.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)