Re: 【JvmPauseMonitor】Timeout detection design reason

Tsz Wo Sze Tue, 24 Oct 2023 10:15:06 -0700

Hi Xinyu,

Thanks for the quick work on RATIS-1918!  The pull request is now merged.


Tsz-Wo



On Tue, Oct 24, 2023 at 2:30 AM Xinyu Tan <[email protected]> wrote:

> Hi, Tsz-Wo
>
> > BTW, the other timeout mechanisms specified in the Raft algorithm may
> also not be suitable for a virtual machine environment.
>
> I suddenly realized that for the "lease read," it uses nanotime to
> determine the duration of the lease. During a virtual machine pause, this
> value in the JVM is likely not to increase. So, it's possible that after
> the old leader's virtual machine is restored, it may still serve read
> requests, leading to the occurrence of a split-brain phenomenon. In this
> regard, perhaps setting it to an infinite value is not a good idea~
>
> However, I strongly support the idea of introducing a separate parameter to
> distinguish it from the judgment of the "slowFollower." Maybe I can create
> an issue and submit a pull request?
>
> Thanks
> ------------------------
> Xinyu Tan
>
> Tsz Wo Sze <[email protected]> 于2023年10月21日周六 00:22写道：
>
> > Hi Xinyu,
> >
> > The JvmPauseMonitor is to monitor the local machine and try to detect if
> it
> > is non-responsive.  As you know, it will shut down the server when the
> > extra sleep is larger than a threshold.  The design is to detect and
> > prevent a running faulty machine since it may slow down the entire
> cluster.
> >
> > I agree that the design is not suitable for a virtual machine
> environment.
> >  (BTW, the other timeout mechanisms specified in the Raft algorithm may
> > also not be suitable for a virtual machine environment.)  As a
> workaround,
> > it is a good idea to set rpcSlownessTimeout to a large value for
> disabling
> > the auto-shutdown.  Instead of using rpcSlownessTimeout, how about we
> use a
> > separate conf for the threshold?  Then, it won't affect the slow follower
> > detection feature.
> >
> > Tsz-Wo
> >
> >
> > On Thu, Oct 19, 2023 at 7:48 PM Xinyu Tan <[email protected]> wrote:
> >
> > > Hello, Ratis community
> > >
> > > I would like to understand the rationale behind a specific design
> detail
> > of
> > > JvmPauseMonitor. In the current code base, when JvmPauseMonitor
> observes
> > a
> > > JVM pause lasting over 60 seconds, it closes the RaftServerProxy in the
> > > handleJvmPause.
> > >
> > > In our production system, some users may stop the virtual machine
> running
> > > the process for several minutes. When they resume the virtual machine,
> > they
> > > find that the RaftServerProxy's state is already Closed, and they must
> > > restart it to restore the correct state. This has caused operational
> > > challenges for us. I would like to know the specific reasons for this
> > > design. What problem is it meant to prevent? If there's no particular
> > > reason, we will consider adjusting the rpcSlownessTimeout to infinity
> in
> > > IoTDB to disable this feature.
> > >
> > > Thanks ------------------------ Xinyu Tan
> > >
> >
>

Re: 【JvmPauseMonitor】Timeout detection design reason

Reply via email to