Hi, Tws-Wo Thanks for the quick review :)
Best ---------------------- Xinyu Tan On 2023/10/24 09:28:36 Xinyu Tan wrote: > Hi, Tsz-Wo > > > BTW, the other timeout mechanisms specified in the Raft algorithm may > also not be suitable for a virtual machine environment. > > I suddenly realized that for the "lease read," it uses nanotime to > determine the duration of the lease. During a virtual machine pause, this > value in the JVM is likely not to increase. So, it's possible that after > the old leader's virtual machine is restored, it may still serve read > requests, leading to the occurrence of a split-brain phenomenon. In this > regard, perhaps setting it to an infinite value is not a good idea~ > > However, I strongly support the idea of introducing a separate parameter to > distinguish it from the judgment of the "slowFollower." Maybe I can create > an issue and submit a pull request? > > Thanks > ------------------------ > Xinyu Tan > > Tsz Wo Sze <[email protected]> 于2023年10月21日周六 00:22写道: > > > Hi Xinyu, > > > > The JvmPauseMonitor is to monitor the local machine and try to detect if it > > is non-responsive. As you know, it will shut down the server when the > > extra sleep is larger than a threshold. The design is to detect and > > prevent a running faulty machine since it may slow down the entire cluster. > > > > I agree that the design is not suitable for a virtual machine environment. > > (BTW, the other timeout mechanisms specified in the Raft algorithm may > > also not be suitable for a virtual machine environment.) As a workaround, > > it is a good idea to set rpcSlownessTimeout to a large value for disabling > > the auto-shutdown. Instead of using rpcSlownessTimeout, how about we use a > > separate conf for the threshold? Then, it won't affect the slow follower > > detection feature. > > > > Tsz-Wo > > > > > > On Thu, Oct 19, 2023 at 7:48 PM Xinyu Tan <[email protected]> wrote: > > > > > Hello, Ratis community > > > > > > I would like to understand the rationale behind a specific design detail > > of > > > JvmPauseMonitor. In the current code base, when JvmPauseMonitor observes > > a > > > JVM pause lasting over 60 seconds, it closes the RaftServerProxy in the > > > handleJvmPause. > > > > > > In our production system, some users may stop the virtual machine running > > > the process for several minutes. When they resume the virtual machine, > > they > > > find that the RaftServerProxy's state is already Closed, and they must > > > restart it to restore the correct state. This has caused operational > > > challenges for us. I would like to know the specific reasons for this > > > design. What problem is it meant to prevent? If there's no particular > > > reason, we will consider adjusting the rpcSlownessTimeout to infinity in > > > IoTDB to disable this feature. > > > > > > Thanks ------------------------ Xinyu Tan > > > > > >
