[ 
https://issues.apache.org/jira/browse/RATIS-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiacheng Liu reassigned RATIS-1709:
-----------------------------------

    Assignee: Jiacheng Liu

> Add UncaughtExceptionHandler to RaftServer and Daemon threads
> -------------------------------------------------------------
>
>                 Key: RATIS-1709
>                 URL: https://issues.apache.org/jira/browse/RATIS-1709
>             Project: Ratis
>          Issue Type: Improvement
>            Reporter: Jiacheng Liu
>            Assignee: Jiacheng Liu
>            Priority: Major
>
> In Ratis many threads are created using `Daemon` class manually. For threads 
> like this, if there's an uncaught exception, the thread will just crash 
> silently without other components knowing. If the thread happens to be a 
> critical component then some part of the RaftServer is essentially down, 
> whereas the RaftServer's lifecycle is still RUNNING (not set to EXCEPTION 
> because the thread didn't have a chance).
> One example where this can happen is 
> [https://github.com/apache/ratis/pull/417/files] Before this change is in, 
> the StateMachineUpdater thread can throw NPE and exit, so the follower 
> RaftServer stays stale forever. The RaftServer's lifecycle is RUNNING and 
> there's no way for the external party to know by 
> `RaftServer.getLifeCycleState()`.
> The proposal is to improve observability on RaftServer to ensure an uncaught 
> exception can be caught and propagated to the external user, by multiple 
> folds:
>  # For all `Daemon` threads, they should have UncaughtExceptionHandler set.
>  # The UncaughtExceptionHandler is defined by the application by 
> RaftServer.Builder when creating the RaftServer. Then the RaftServer 
> propagates the handler to each Daemon thread on creating them.
> So external users canĀ 
> {code:java}
> AtomicBoolean raftCrashed = new AtomicBoolean(false);
> AtomicReference<Throwable> raftError = new AtomicReference<>(null);
> RaftServer server = RaftServer.newBuilder()
>   .setUncaughtExceptionHandler((thread, ex) -> {
>     raftCrashed.set(true);
>     raftError.set(ex);
>   }).build();
> // Periodically check
> if (raftCrashed) {
>   LOG.error("RaftServer crashed", raftError.get());
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to