Re: 【JvmPauseMonitor】Timeout detection design reason

Xinyu Tan Tue, 24 Oct 2023 18:44:45 -0700

Hi, Tws-Wo

Thanks for the quick review :)


Best
----------------------
Xinyu Tan

On 2023/10/24 09:28:36 Xinyu Tan wrote:
> Hi, Tsz-Wo
> 
> > BTW, the other timeout mechanisms specified in the Raft algorithm may
> also not be suitable for a virtual machine environment.
> 
> I suddenly realized that for the "lease read," it uses nanotime to
> determine the duration of the lease. During a virtual machine pause, this
> value in the JVM is likely not to increase. So, it's possible that after
> the old leader's virtual machine is restored, it may still serve read
> requests, leading to the occurrence of a split-brain phenomenon. In this
> regard, perhaps setting it to an infinite value is not a good idea~
> 
> However, I strongly support the idea of introducing a separate parameter to
> distinguish it from the judgment of the "slowFollower." Maybe I can create
> an issue and submit a pull request?
> 
> Thanks
> ------------------------
> Xinyu Tan
> 
> Tsz Wo Sze <[email protected]> 于2023年10月21日周六 00:22写道：
> 
> > Hi Xinyu,
> >
> > The JvmPauseMonitor is to monitor the local machine and try to detect if it
> > is non-responsive.  As you know, it will shut down the server when the
> > extra sleep is larger than a threshold.  The design is to detect and
> > prevent a running faulty machine since it may slow down the entire cluster.
> >
> > I agree that the design is not suitable for a virtual machine environment.
> >  (BTW, the other timeout mechanisms specified in the Raft algorithm may
> > also not be suitable for a virtual machine environment.)  As a workaround,
> > it is a good idea to set rpcSlownessTimeout to a large value for disabling
> > the auto-shutdown.  Instead of using rpcSlownessTimeout, how about we use a
> > separate conf for the threshold?  Then, it won't affect the slow follower
> > detection feature.
> >
> > Tsz-Wo
> >
> >
> > On Thu, Oct 19, 2023 at 7:48 PM Xinyu Tan <[email protected]> wrote:
> >
> > > Hello, Ratis community
> > >
> > > I would like to understand the rationale behind a specific design detail
> > of
> > > JvmPauseMonitor. In the current code base, when JvmPauseMonitor observes
> > a
> > > JVM pause lasting over 60 seconds, it closes the RaftServerProxy in the
> > > handleJvmPause.
> > >
> > > In our production system, some users may stop the virtual machine running
> > > the process for several minutes. When they resume the virtual machine,
> > they
> > > find that the RaftServerProxy's state is already Closed, and they must
> > > restart it to restore the correct state. This has caused operational
> > > challenges for us. I would like to know the specific reasons for this
> > > design. What problem is it meant to prevent? If there's no particular
> > > reason, we will consider adjusting the rpcSlownessTimeout to infinity in
> > > IoTDB to disable this feature.
> > >
> > > Thanks ------------------------ Xinyu Tan
> > >
> >
>

Re: 【JvmPauseMonitor】Timeout detection design reason

Reply via email to