> One more thing is that it may affect client retries ... We should detect and ignore client retries for requests sent a long time ago; filed RATIS-1922.
Tsz-Wo On Fri, Oct 27, 2023 at 8:58 AM Tsz Wo Sze <[email protected]> wrote: > One more thing is that it may affect client retries -- When a client is > sending retry requests, the virtual machine of the client could be > stopped in the middle. However, the servers keep running and the retry > cache entries may time out. Then, when the client vm wakes up in the next > day, the client will still send the retries for the requests sent yesterday > and the servers will treat these retries as new requests. > > Tsz-Wo > > > On Thu, Oct 26, 2023 at 11:47 PM Xinyu Tan <[email protected]> wrote: > >> > Under the situation where the JVM pauses are frequent and durable, it’s >> better to use Core-Raft read(through Raft Log) if you still want 100% >> consistency. >> >> In fact, relying on heartbeat to confirm one's leadership status is >> fundamentally based on logical clocks rather than physical clocks. So, for >> Ratis 3.0, if LINEARIZABLE reads are enabled and leases are disabled, >> theoretically, there should be no security issues. >> >> I believe that if users want 100% consistency, instead of implementing >> LINEARIZABLE reads slowly through the raftlog, we might recommend users to >> simply disable leases. This way, performance may be better. What do you all >> think? >> >> Best >> --------------------- >> Xinyu Tan >> >> On 2023/10/26 01:29:11 William Song wrote: >> > Core-Raft guarantees safeness by using logical clock(term-index) >> instead of physical clock. However, optimizations like leader-bypass-read >> or leader-lease-read rely on physical clock (leader election timeout). We >> already have a leaderStepDownWaitTime in JVMPauseMonitor to prevent this >> situation. Still, leaderStepDownWaitTime cannot guarantee 100% >> linearizability. >> > >> > Under the situation where the JVM pauses are frequent and durable, it’s >> better to use Core-Raft read(through Raft Log) if you still want 100% >> consistency. >> > >> > Best, >> > William >> > >> > > 2023年10月24日 17:28,Xinyu Tan <[email protected]> 写道: >> > > >> > > Hi, Tsz-Wo >> > > >> > >> BTW, the other timeout mechanisms specified in the Raft algorithm may >> > > also not be suitable for a virtual machine environment. >> > > >> > > I suddenly realized that for the "lease read," it uses nanotime to >> > > determine the duration of the lease. During a virtual machine pause, >> this >> > > value in the JVM is likely not to increase. So, it's possible that >> after >> > > the old leader's virtual machine is restored, it may still serve read >> > > requests, leading to the occurrence of a split-brain phenomenon. In >> this >> > > regard, perhaps setting it to an infinite value is not a good idea~ >> > > >> > > However, I strongly support the idea of introducing a separate >> parameter to >> > > distinguish it from the judgment of the "slowFollower." Maybe I can >> create >> > > an issue and submit a pull request? >> > > >> > > Thanks >> > > ------------------------ >> > > Xinyu Tan >> > > >> > > Tsz Wo Sze <[email protected]> 于2023年10月21日周六 00:22写道: >> > > >> > >> Hi Xinyu, >> > >> >> > >> The JvmPauseMonitor is to monitor the local machine and try to >> detect if it >> > >> is non-responsive. As you know, it will shut down the server when >> the >> > >> extra sleep is larger than a threshold. The design is to detect and >> > >> prevent a running faulty machine since it may slow down the entire >> cluster. >> > >> >> > >> I agree that the design is not suitable for a virtual machine >> environment. >> > >> (BTW, the other timeout mechanisms specified in the Raft algorithm >> may >> > >> also not be suitable for a virtual machine environment.) As a >> workaround, >> > >> it is a good idea to set rpcSlownessTimeout to a large value for >> disabling >> > >> the auto-shutdown. Instead of using rpcSlownessTimeout, how about >> we use a >> > >> separate conf for the threshold? Then, it won't affect the slow >> follower >> > >> detection feature. >> > >> >> > >> Tsz-Wo >> > >> >> > >> >> > >> On Thu, Oct 19, 2023 at 7:48 PM Xinyu Tan <[email protected]> >> wrote: >> > >> >> > >>> Hello, Ratis community >> > >>> >> > >>> I would like to understand the rationale behind a specific design >> detail >> > >> of >> > >>> JvmPauseMonitor. In the current code base, when JvmPauseMonitor >> observes >> > >> a >> > >>> JVM pause lasting over 60 seconds, it closes the RaftServerProxy in >> the >> > >>> handleJvmPause. >> > >>> >> > >>> In our production system, some users may stop the virtual machine >> running >> > >>> the process for several minutes. When they resume the virtual >> machine, >> > >> they >> > >>> find that the RaftServerProxy's state is already Closed, and they >> must >> > >>> restart it to restore the correct state. This has caused operational >> > >>> challenges for us. I would like to know the specific reasons for >> this >> > >>> design. What problem is it meant to prevent? If there's no >> particular >> > >>> reason, we will consider adjusting the rpcSlownessTimeout to >> infinity in >> > >>> IoTDB to disable this feature. >> > >>> >> > >>> Thanks ------------------------ Xinyu Tan >> > >>> >> > >> >> > >> > >> >
