Xinhao GU created RATIS-2282:
--------------------------------

             Summary: LogAppender Restart Due to Premature Log Entry Access 
During Concurrent Write Processing
                 Key: RATIS-2282
                 URL: https://issues.apache.org/jira/browse/RATIS-2282
             Project: Ratis
          Issue Type: Bug
            Reporter: Xinhao GU
            Assignee: Xinhao GU
         Attachments: image-2025-04-18-17-36-36-096.png

We observed two recurring patterns of {{GrpcLogAppender}} restarts caused by 
log entry access failures:
h3. Problem 1: Log entry not found
{code:java}
2025-04-14 11:47:39,566 
[4@group-000100000002->5-GrpcLogAppender-LogAppenderDaemon] WARN  
o.a.r.s.l.LogAppenderDaemon:89 - 
4@group-000100000002->5-GrpcLogAppender-LogAppenderDaemon failed 
org.apache.ratis.server.raftlog.RaftLogIOException: Log entry not found: index 
= 376
    at 
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.retainEntryWithData(SegmentedRaftLog.java:337)
    at 
org.apache.ratis.server.leader.LogAppenderBase.readLogEntries(LogAppenderBase.java:376)
    at 
org.apache.ratis.server.leader.LogAppenderBase.nextAppendEntriesRequest(LogAppenderBase.java:332)
    at 
org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:403)
    at 
org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:269)
    at 
org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:80)
    at java.base/java.lang.Thread.run(Thread.java:829)
    
2025-04-14 11:47:39,568 
[4@group-000100000002->5-GrpcLogAppender-LogAppenderDaemon] INFO  
o.a.r.s.i.LeaderStateImpl:688 - 4@group-000100000002-LeaderStateImpl: 
Restarting GrpcLogAppender for 4@group-000100000002->5      {code}
h3. Problem 2: Failed to load log entry
{code:java}
2025-04-14 22:06:33,971 
[3@group-000100000002->2-GrpcLogAppender-LogAppenderDaemon] WARN  
o.a.r.s.l.LogAppenderDaemon:89 - 
3@group-000100000002->2-GrpcLogAppender-LogAppenderDaemon failed 
org.apache.ratis.server.raftlog.RaftLogIOException: Failed to load log entry 
(t:2, i:118)
    at 
org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:306)
    at 
org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:520)
    at 
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.retainLog(SegmentedRaftLog.java:325)
    at 
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.retainEntryWithData(SegmentedRaftLog.java:335)
    at 
org.apache.ratis.server.leader.LogAppenderBase.readLogEntries(LogAppenderBase.java:376)
    at 
org.apache.ratis.server.leader.LogAppenderBase.nextAppendEntriesRequest(LogAppenderBase.java:332)
    at 
org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:403)
    at 
org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:269)
    at 
org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:80)
    at java.lang.Thread.run(Thread.java:748)
    
2025-04-14 22:06:34,001 
[3@group-000100000002->2-GrpcLogAppender-LogAppenderDaemon] INFO  
o.a.r.s.i.LeaderStateImpl:688 - 3@group-000100000002-LeaderStateImpl: 
Restarting GrpcLogAppender for 3@group-000100000002->2          {code}
h3. *Cause Analysis:*

Both exceptions occur due to race conditions between:
 # Leader writes log entry into memory cache and disk
 # Leader sends log entry to followers via {{GrpcLogAppender}}

The current implementation allows log replication to attempt accessing entries 
before they're written. This is evidenced by:
 * The successful prevention of errors through experimental sleep delays in 
critical code paths

!image-2025-04-18-17-36-36-096.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to