[
https://issues.apache.org/jira/browse/RATIS-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jiacheng Liu reassigned RATIS-1709:
-----------------------------------
Assignee: Jiacheng Liu
> Add UncaughtExceptionHandler to RaftServer and Daemon threads
> -------------------------------------------------------------
>
> Key: RATIS-1709
> URL: https://issues.apache.org/jira/browse/RATIS-1709
> Project: Ratis
> Issue Type: Improvement
> Reporter: Jiacheng Liu
> Assignee: Jiacheng Liu
> Priority: Major
>
> In Ratis many threads are created using `Daemon` class manually. For threads
> like this, if there's an uncaught exception, the thread will just crash
> silently without other components knowing. If the thread happens to be a
> critical component then some part of the RaftServer is essentially down,
> whereas the RaftServer's lifecycle is still RUNNING (not set to EXCEPTION
> because the thread didn't have a chance).
> One example where this can happen is
> [https://github.com/apache/ratis/pull/417/files] Before this change is in,
> the StateMachineUpdater thread can throw NPE and exit, so the follower
> RaftServer stays stale forever. The RaftServer's lifecycle is RUNNING and
> there's no way for the external party to know by
> `RaftServer.getLifeCycleState()`.
> The proposal is to improve observability on RaftServer to ensure an uncaught
> exception can be caught and propagated to the external user, by multiple
> folds:
> # For all `Daemon` threads, they should have UncaughtExceptionHandler set.
> # The UncaughtExceptionHandler is defined by the application by
> RaftServer.Builder when creating the RaftServer. Then the RaftServer
> propagates the handler to each Daemon thread on creating them.
> So external users canĀ
> {code:java}
> AtomicBoolean raftCrashed = new AtomicBoolean(false);
> AtomicReference<Throwable> raftError = new AtomicReference<>(null);
> RaftServer server = RaftServer.newBuilder()
> .setUncaughtExceptionHandler((thread, ex) -> {
> raftCrashed.set(true);
> raftError.set(ex);
> }).build();
> // Periodically check
> if (raftCrashed) {
> LOG.error("RaftServer crashed", raftError.get());
> }{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)