Hi Xinyu, The JvmPauseMonitor is to monitor the local machine and try to detect if it is non-responsive. As you know, it will shut down the server when the extra sleep is larger than a threshold. The design is to detect and prevent a running faulty machine since it may slow down the entire cluster.
I agree that the design is not suitable for a virtual machine environment. (BTW, the other timeout mechanisms specified in the Raft algorithm may also not be suitable for a virtual machine environment.) As a workaround, it is a good idea to set rpcSlownessTimeout to a large value for disabling the auto-shutdown. Instead of using rpcSlownessTimeout, how about we use a separate conf for the threshold? Then, it won't affect the slow follower detection feature. Tsz-Wo On Thu, Oct 19, 2023 at 7:48 PM Xinyu Tan <[email protected]> wrote: > Hello, Ratis community > > I would like to understand the rationale behind a specific design detail of > JvmPauseMonitor. In the current code base, when JvmPauseMonitor observes a > JVM pause lasting over 60 seconds, it closes the RaftServerProxy in the > handleJvmPause. > > In our production system, some users may stop the virtual machine running > the process for several minutes. When they resume the virtual machine, they > find that the RaftServerProxy's state is already Closed, and they must > restart it to restore the correct state. This has caused operational > challenges for us. I would like to know the specific reasons for this > design. What problem is it meant to prevent? If there's no particular > reason, we will consider adjusting the rpcSlownessTimeout to infinity in > IoTDB to disable this feature. > > Thanks ------------------------ Xinyu Tan >
