[
https://issues.apache.org/jira/browse/RATIS-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599916#comment-17599916
]
Jiacheng Liu commented on RATIS-1695:
-------------------------------------
One step further down this thread, there's much more we can do to further
improve the state exposed on the RaftServer. For each critical operation, if
that fails, it can manually set the state on the RaftServer and transition that
to EXCEPTION. Currently that transition is not triggered very often, we are
underusing the EXCEPTION state.
> Improve observability and internal error detection in RaftServer
> ----------------------------------------------------------------
>
> Key: RATIS-1695
> URL: https://issues.apache.org/jira/browse/RATIS-1695
> Project: Ratis
> Issue Type: Improvement
> Components: server
> Reporter: Jiacheng Liu
> Priority: Major
> Time Spent: 1h 20m
> Remaining Estimate: 0h
>
> In Ratis many threads are created using `Daemon` class manually. For threads
> like this, if there's an uncaught exception, the thread will just crash
> silently without other components knowing. If the thread happens to be a
> critical component then some part of the RaftServer is essentially down,
> whereas the RaftServer's lifecycle is still RUNNING (not set to EXCEPTION
> because the thread didn't have a chance).
> One example where this can happen is
> [https://github.com/apache/ratis/pull/417/files] Before this change is in,
> the StateMachineUpdater thread can throw NPE and exit, so the follower
> RaftServer stays stale forever. The RaftServer's lifecycle is RUNNING and
> there's no way for the external party to know by
> `RaftServer.getLifeCycleState()`.
> The proposal is to improve observability on RaftServer to ensure an uncaught
> exception can be caught and propagated to the external user, by multiple
> folds:
> # For all `Daemon` threads, they should have UncaughtExceptionHandler set.
> # Add an extra field to the RaftServer to store an exception, and that field
> can be set by the UncaughtExceptionHandler instances.
> # The UncaughtExceptionHandler also transitions the RaftServer to EXCEPTION
> state.
> So external users canĀ
> {code:java}
> RaftServer server = RaftServer.newBuilder().build();
> // Periodically check
> if (server.getLifeCycleState() == State.EXCEPTION) {
> Throwable t = server.getError();
> // Deal with the throwable
> }{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)