[ 
https://issues.apache.org/jira/browse/RATIS-2278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17943878#comment-17943878
 ] 

Xinhao GU edited comment on RATIS-2278 at 4/13/25 4:33 AM:
-----------------------------------------------------------

Hi, [~szetszwo]. 
Leader sends two appendEntries requests to follower in a row.  For the first 
appendEntries request, the index range is {*}[104,114]{*}. While for the second 
appendEntries request, the index range is {*}[105,115]{*}. Normally, the index 
in the second appendEntries request should start at 115, but why start at 105? 
This is because before the leader constructing the second appendEntries 
request, the reply to previous requests returned by the follower is 
'INCONSISTENCY', causing the leader to update the nextIndex to {*}105{*}. 
!output.png!


was (Author: JIRAUSER308993):
Hi, [~szetszwo]. 
Leader sends two appendEntries requests to follower in a row.  For the first 
appendEntries request, the index range is {*}[104,114]{*}. While for the second 
appendEntries request, the index range is {*}[105,115]{*}. Normally, the index 
in the second appendEntries request should start at 115, but why start at 105? 
This is before the leader constructing the second appendEntries request, the 
reply to previous requests returned by the follower is 'INCONSISTENCY', causing 
the leader to update the nextIndex to {*}105{*}. 
!output.png!

> Follower Fails to Append Entries Due to Index Validation Race Condition in 
> NavigableIndices
> -------------------------------------------------------------------------------------------
>
>                 Key: RATIS-2278
>                 URL: https://issues.apache.org/jira/browse/RATIS-2278
>             Project: Ratis
>          Issue Type: Bug
>          Components: Leader
>            Reporter: Xinhao GU
>            Assignee: Xinhao GU
>            Priority: Major
>             Fix For: 3.2.0
>
>         Attachments: image-2025-04-06-22-56-34-207.png, 
> image-2025-04-12-21-39-10-485.png, image.png, output.png
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> h2. *Problem Description*
> When a follower receives overlapping {{appendEntries}} requests from the 
> leader, it may throw an {{IllegalStateException}} during log index 
> validation. Specifically, the {{NavigableIndices}} class fails to handle 
> concurrent log appends properly, causing the follower to incorrectly reject 
> valid entries with an "index mismatch" error.
> h2. *Steps to Reproduce*
>  # {*}Cluster Setup{*}: Deploy a Raft group with at least 1 leader and 2 
> followers.
>  # {*}High Concurrency Scenario{*}: Trigger frequent {{appendEntries}} 
> requests from the leader to a follower.
>  # {*}Overlapping Requests{*}: Ensure two threads ({{{}thread-1{}}} and 
> {{{}thread-2{}}}) process {{appendEntries}} requests simultaneously:
>  ** {{thread-1}} appends logs with indices {{16093–16185}} and enters 
> {{state.getLog().append()}} but hasn’t invoked {{{}removeExisting(){}}}.
>  ** {{thread-2}} attempts to append logs starting at index {{16093}} (from a 
> new request) while {{thread-1}} is still processing.
>  # {*}Validation Failure{*}: The {{NavigableIndices.append()}} method detects 
> a gap between {{{}thread-1{}}}'s last index ({{{}16185{}}}) and 
> {{{}thread-2{}}}'s start index ({{{}16093{}}}), throwing an error.
> h2. *Expected Behavior*
> Concurrent {{appendEntries}} requests should be serialized or validated 
> atomically to prevent index gaps. Followers should accept logs as long as 
> they pass the initial {{ConsecutiveIndices.convert()}} check.
> h2. *Actual Behavior*
>  
> {code:java}
> 2025-03-04 05:44:14,946 [7-server-thread2] ERROR 
> o.a.r.s.i.RaftServerImpl:1488 - 7@group-000200000000: Failed appendEntries 
> 8->7#73-t49,previous=(t:3, i:16092),leaderCommit=16184,initializing? 
> false,entries: size=93, first=(t:4, i:16093), 
> CONFIGURATIONENTRY(current:id:"6"address:"172.16.2.14:10750"startupRole:FOLLOWER,
>  id:"7"address:"172.16.2.15:10750"startupRole:FOLLOWER, 
> id:"8"address:"172.16.2.11:10750"startupRole:FOLLOWER, old 
> java.lang.IllegalStateException: startIndex: expected == 16186 but computed 
> == 16093 at 
> org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:77)
> at org.apache.ratis.util.Preconditions.assertSame(Preconditions.java:87)
> at 
> org.apache.ratis.server.impl.ServerImplUtils$NavigableIndices.append(ServerImplUtils.java:144)
> at 
> org.apache.ratis.server.impl.RaftServerImpl.appendLog(RaftServerImpl.java:1631)
> at 
> org.apache.ratis.server.impl.RaftServerImpl.appendEntriesAsync(RaftServerImpl.java:1596)
> at 
> org.apache.ratis.server.impl.RaftServerImpl.appendEntriesAsync(RaftServerImpl.java:1486)
> at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$null$28(RaftServerProxy.java:646)
> at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:118)
> at 
> org.apache.ratis.server.impl.RaftServerImpl.lambda$executeSubmitServerRequestAsync$10(RaftServerImpl.java:906)
> at 
> java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> at java.base/java.lang.Thread.run(Thread.java:833)
> 2025-03-04 05:44:14,950 [grpc-default-executor-8] WARN 
> o.a.ratis.util.LogUtils:129 - 7: Failed APPEND_ENTRIES request 
> 8->7#73-t49,previous=(t:3, i:16092),leaderCommit=16184,initializing? 
> false,entries: size=93, first=(t:4, i:16093), 
> CONFIGURATIONENTRY(current:id:"6"address:"172.16.2.14:10750"startupRole:FOLLOWER,
>  id:"7"address:"172.16.2.15:10750"startupRole:FOLLOWER, 
> id:"8"address:"172.16.2.11:10750"startupRole:FOLLOWER, old 
> java.util.concurrent.CompletionException: java.io.IOException: 
> java.lang.IllegalStateException: startIndex: expected == 16186 but computed 
> == 16093
> at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:122)
> at 
> org.apache.ratis.server.impl.RaftServerImpl.lambda$executeSubmitServerRequestAsync$10(RaftServerImpl.java:906)
> at 
> java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> at java.base/java.lang.Thread.run(Thread.java:833)
> Caused by: java.io.IOException: java.lang.IllegalStateException: startIndex: 
> expected == 16186 but computed == 16093
> at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:56)
> at 
> org.apache.ratis.server.impl.RaftServerImpl.appendEntriesAsync(RaftServerImpl.java:1490)
> at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$null$28(RaftServerProxy.java:646)
> at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:118)
> ... 5 common frames omitted
> Caused by: java.lang.IllegalStateException: startIndex: expected == 16186 but 
> computed == 16093
> at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:77)
> at org.apache.ratis.util.Preconditions.assertSame(Preconditions.java:87)
> at 
> org.apache.ratis.server.impl.ServerImplUtils$NavigableIndices.append(ServerImplUtils.java:144)
> at 
> org.apache.ratis.server.impl.RaftServerImpl.appendLog(RaftServerImpl.java:1631)
> at 
> org.apache.ratis.server.impl.RaftServerImpl.appendEntriesAsync(RaftServerImpl.java:1596)
> at 
> org.apache.ratis.server.impl.RaftServerImpl.appendEntriesAsync(RaftServerImpl.java:1486)
> ... 7 common frames omitted*{code}
> *!image-2025-04-06-22-56-34-207.png|width=1143,height=288!*
>  
> h2. *Root Cause Analysis*
>  # {*}Thread Race Condition{*}:
> The {{NavigableIndices}} class uses {{synchronized}} methods for {{append()}} 
> and {{{}removeExisting(){}}}, but *inter-method concurrency* is not handled. 
> If two threads interleave execution (e.g., {{thread-1}} appends logs but 
> hasn’t cleaned up {{map}} via {{{}removeExisting(){}}}, while {{thread-2}} 
> starts a new append), the index continuity check in {{append()}} fails 
> erroneously.
>  # {*}Redundant Validation{*}:
> The {{append()}} method revalidates log continuity using the {{{}map{}}}’s 
> last entry, even though {{ConsecutiveIndices.convert()}} already ensures 
> intra-request index continuity. This creates unnecessary 
> concurrency-sensitive checks.
> h2. *Proposed Solutions*
> h3. {*}Option 1{*}: Remove Redundant Validation in {{append()}}
> Modify {{NavigableIndices.append()}} to skip the startIndex check, relying 
> solely on {{convert()}} for validation.
> {*}Justification{*}:
>  * {{convert()}} already ensures all entries in a single request are 
> consecutive.
>  * {{removeExisting()}} guarantees cleanup after appending, making 
> cross-request checks redundant and race-prone.
> h3. {*}Option 2{*}: Global Lock for {{NavigableIndices'map}}
> Add a global lock for map to serialize all {{append()}} and 
> {{removeExisting()}} operations. However, this could introduce performance 
> bottlenecks.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to