Re: 【JvmPauseMonitor】Timeout detection design reason

Xinyu Tan Tue, 24 Oct 2023 02:28:55 -0700

Hi, Tsz-Wo

> BTW, the other timeout mechanisms specified in the Raft algorithm may
also not be suitable for a virtual machine environment.


I suddenly realized that for the "lease read," it uses nanotime to
determine the duration of the lease. During a virtual machine pause, this
value in the JVM is likely not to increase. So, it's possible that after
the old leader's virtual machine is restored, it may still serve read
requests, leading to the occurrence of a split-brain phenomenon. In this
regard, perhaps setting it to an infinite value is not a good idea~

However, I strongly support the idea of introducing a separate parameter to
distinguish it from the judgment of the "slowFollower." Maybe I can create
an issue and submit a pull request?

Thanks
------------------------
Xinyu Tan

Tsz Wo Sze <[email protected]> 于2023年10月21日周六 00:22写道：

> Hi Xinyu,
>
> The JvmPauseMonitor is to monitor the local machine and try to detect if it
> is non-responsive.  As you know, it will shut down the server when the
> extra sleep is larger than a threshold.  The design is to detect and
> prevent a running faulty machine since it may slow down the entire cluster.
>
> I agree that the design is not suitable for a virtual machine environment.
>  (BTW, the other timeout mechanisms specified in the Raft algorithm may
> also not be suitable for a virtual machine environment.)  As a workaround,
> it is a good idea to set rpcSlownessTimeout to a large value for disabling
> the auto-shutdown.  Instead of using rpcSlownessTimeout, how about we use a
> separate conf for the threshold?  Then, it won't affect the slow follower
> detection feature.
>
> Tsz-Wo
>
>
> On Thu, Oct 19, 2023 at 7:48 PM Xinyu Tan <[email protected]> wrote:
>
> > Hello, Ratis community
> >
> > I would like to understand the rationale behind a specific design detail
> of
> > JvmPauseMonitor. In the current code base, when JvmPauseMonitor observes
> a
> > JVM pause lasting over 60 seconds, it closes the RaftServerProxy in the
> > handleJvmPause.
> >
> > In our production system, some users may stop the virtual machine running
> > the process for several minutes. When they resume the virtual machine,
> they
> > find that the RaftServerProxy's state is already Closed, and they must
> > restart it to restore the correct state. This has caused operational
> > challenges for us. I would like to know the specific reasons for this
> > design. What problem is it meant to prevent? If there's no particular
> > reason, we will consider adjusting the rpcSlownessTimeout to infinity in
> > IoTDB to disable this feature.
> >
> > Thanks ------------------------ Xinyu Tan
> >
>

Re: 【JvmPauseMonitor】Timeout detection design reason

Reply via email to