[ 
https://issues.apache.org/jira/browse/RATIS-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved RATIS-2282.
-------------------------------
    Fix Version/s: 3.2.0
       Resolution: Fixed

The pull request is now merged.  Thanks, [~gewu12207] !

> LogAppender Restart Due to Premature Log Entry Access During Concurrent Write 
> Processing
> ----------------------------------------------------------------------------------------
>
>                 Key: RATIS-2282
>                 URL: https://issues.apache.org/jira/browse/RATIS-2282
>             Project: Ratis
>          Issue Type: Bug
>          Components: Leader
>            Reporter: Xinhao GU
>            Assignee: Xinhao GU
>            Priority: Major
>             Fix For: 3.2.0
>
>         Attachments: a.patch, image-2025-04-18-17-36-36-096.png, 
> image-2025-04-19-16-37-16-322.png, image-2025-04-19-16-37-34-179.png
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> We observed two recurring patterns of {{GrpcLogAppender}} restarts caused by 
> log entry access failures:
> h3. Problem 1: Log entry not found
> {code:java}
> 2025-04-14 11:47:39,566 
> [4@group-000100000002->5-GrpcLogAppender-LogAppenderDaemon] WARN  
> o.a.r.s.l.LogAppenderDaemon:89 - 
> 4@group-000100000002->5-GrpcLogAppender-LogAppenderDaemon failed 
> org.apache.ratis.server.raftlog.RaftLogIOException: Log entry not found: 
> index = 376
>     at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.retainEntryWithData(SegmentedRaftLog.java:337)
>     at 
> org.apache.ratis.server.leader.LogAppenderBase.readLogEntries(LogAppenderBase.java:376)
>     at 
> org.apache.ratis.server.leader.LogAppenderBase.nextAppendEntriesRequest(LogAppenderBase.java:332)
>     at 
> org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:403)
>     at 
> org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:269)
>     at 
> org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:80)
>     at java.base/java.lang.Thread.run(Thread.java:829)
>     
> 2025-04-14 11:47:39,568 
> [4@group-000100000002->5-GrpcLogAppender-LogAppenderDaemon] INFO  
> o.a.r.s.i.LeaderStateImpl:688 - 4@group-000100000002-LeaderStateImpl: 
> Restarting GrpcLogAppender for 4@group-000100000002->5      {code}
> h3. Problem 2: Failed to load log entry
> {code:java}
> 2025-04-14 22:06:33,971 
> [3@group-000100000002->2-GrpcLogAppender-LogAppenderDaemon] WARN  
> o.a.r.s.l.LogAppenderDaemon:89 - 
> 3@group-000100000002->2-GrpcLogAppender-LogAppenderDaemon failed 
> org.apache.ratis.server.raftlog.RaftLogIOException: Failed to load log entry 
> (t:2, i:118)
>     at 
> org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:306)
>     at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:520)
>     at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.retainLog(SegmentedRaftLog.java:325)
>     at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.retainEntryWithData(SegmentedRaftLog.java:335)
>     at 
> org.apache.ratis.server.leader.LogAppenderBase.readLogEntries(LogAppenderBase.java:376)
>     at 
> org.apache.ratis.server.leader.LogAppenderBase.nextAppendEntriesRequest(LogAppenderBase.java:332)
>     at 
> org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:403)
>     at 
> org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:269)
>     at 
> org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:80)
>     at java.lang.Thread.run(Thread.java:748)
>     
> 2025-04-14 22:06:34,001 
> [3@group-000100000002->2-GrpcLogAppender-LogAppenderDaemon] INFO  
> o.a.r.s.i.LeaderStateImpl:688 - 3@group-000100000002-LeaderStateImpl: 
> Restarting GrpcLogAppender for 3@group-000100000002->2          {code}
> h3. *Cause Analysis:*
> Both exceptions occur due to race conditions between:
>  # Leader writes log entry into memory cache and disk
>  # Leader sends log entry to followers via {{GrpcLogAppender}}
> The current implementation allows log replication to attempt accessing 
> entries before they're written. This is evidenced by:
>  * The successful prevention of errors through experimental sleep delays in 
> critical code paths
> !image-2025-04-18-17-36-36-096.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to