Xinhao GU created RATIS-2282:
--------------------------------
Summary: LogAppender Restart Due to Premature Log Entry Access
During Concurrent Write Processing
Key: RATIS-2282
URL: https://issues.apache.org/jira/browse/RATIS-2282
Project: Ratis
Issue Type: Bug
Reporter: Xinhao GU
Assignee: Xinhao GU
Attachments: image-2025-04-18-17-36-36-096.png
We observed two recurring patterns of {{GrpcLogAppender}} restarts caused by
log entry access failures:
h3. Problem 1: Log entry not found
{code:java}
2025-04-14 11:47:39,566
[4@group-000100000002->5-GrpcLogAppender-LogAppenderDaemon] WARN
o.a.r.s.l.LogAppenderDaemon:89 -
4@group-000100000002->5-GrpcLogAppender-LogAppenderDaemon failed
org.apache.ratis.server.raftlog.RaftLogIOException: Log entry not found: index
= 376
at
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.retainEntryWithData(SegmentedRaftLog.java:337)
at
org.apache.ratis.server.leader.LogAppenderBase.readLogEntries(LogAppenderBase.java:376)
at
org.apache.ratis.server.leader.LogAppenderBase.nextAppendEntriesRequest(LogAppenderBase.java:332)
at
org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:403)
at
org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:269)
at
org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:80)
at java.base/java.lang.Thread.run(Thread.java:829)
2025-04-14 11:47:39,568
[4@group-000100000002->5-GrpcLogAppender-LogAppenderDaemon] INFO
o.a.r.s.i.LeaderStateImpl:688 - 4@group-000100000002-LeaderStateImpl:
Restarting GrpcLogAppender for 4@group-000100000002->5 {code}
h3. Problem 2: Failed to load log entry
{code:java}
2025-04-14 22:06:33,971
[3@group-000100000002->2-GrpcLogAppender-LogAppenderDaemon] WARN
o.a.r.s.l.LogAppenderDaemon:89 -
3@group-000100000002->2-GrpcLogAppender-LogAppenderDaemon failed
org.apache.ratis.server.raftlog.RaftLogIOException: Failed to load log entry
(t:2, i:118)
at
org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:306)
at
org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:520)
at
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.retainLog(SegmentedRaftLog.java:325)
at
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.retainEntryWithData(SegmentedRaftLog.java:335)
at
org.apache.ratis.server.leader.LogAppenderBase.readLogEntries(LogAppenderBase.java:376)
at
org.apache.ratis.server.leader.LogAppenderBase.nextAppendEntriesRequest(LogAppenderBase.java:332)
at
org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:403)
at
org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:269)
at
org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:80)
at java.lang.Thread.run(Thread.java:748)
2025-04-14 22:06:34,001
[3@group-000100000002->2-GrpcLogAppender-LogAppenderDaemon] INFO
o.a.r.s.i.LeaderStateImpl:688 - 3@group-000100000002-LeaderStateImpl:
Restarting GrpcLogAppender for 3@group-000100000002->2 {code}
h3. *Cause Analysis:*
Both exceptions occur due to race conditions between:
# Leader writes log entry into memory cache and disk
# Leader sends log entry to followers via {{GrpcLogAppender}}
The current implementation allows log replication to attempt accessing entries
before they're written. This is evidenced by:
* The successful prevention of errors through experimental sleep delays in
critical code paths
!image-2025-04-18-17-36-36-096.png!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)