Re: 【JvmPauseMonitor】Timeout detection design reason

Tsz Wo Sze Fri, 20 Oct 2023 09:21:39 -0700

Hi Xinyu,

The JvmPauseMonitor is to monitor the local machine and try to detect if it
is non-responsive.  As you know, it will shut down the server when the
extra sleep is larger than a threshold.  The design is to detect and
prevent a running faulty machine since it may slow down the entire cluster.

I agree that the design is not suitable for a virtual machine environment.
 (BTW, the other timeout mechanisms specified in the Raft algorithm may
also not be suitable for a virtual machine environment.)  As a workaround,
it is a good idea to set rpcSlownessTimeout to a large value for disabling
the auto-shutdown.  Instead of using rpcSlownessTimeout, how about we use a
separate conf for the threshold?  Then, it won't affect the slow follower
detection feature.

Tsz-Wo

On Thu, Oct 19, 2023 at 7:48 PM Xinyu Tan <[email protected]> wrote:

> Hello, Ratis community
>
> I would like to understand the rationale behind a specific design detail of
> JvmPauseMonitor. In the current code base, when JvmPauseMonitor observes a
> JVM pause lasting over 60 seconds, it closes the RaftServerProxy in the
> handleJvmPause.
>
> In our production system, some users may stop the virtual machine running
> the process for several minutes. When they resume the virtual machine, they
> find that the RaftServerProxy's state is already Closed, and they must
> restart it to restore the correct state. This has caused operational
> challenges for us. I would like to know the specific reasons for this
> design. What problem is it meant to prevent? If there's no particular
> reason, we will consider adjusting the rpcSlownessTimeout to infinity in
> IoTDB to disable this feature.
>
> Thanks ------------------------ Xinyu Tan
>

Re: 【JvmPauseMonitor】Timeout detection design reason

Reply via email to