[
https://issues.apache.org/jira/browse/RATIS-2278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17944010#comment-17944010
]
Xinhao GU commented on RATIS-2278:
----------------------------------
Hi, [~szetszwo]. Because the "INCONSISTENCY" is not the reply of "AppendEntries
request [104,114]", but previous request.
> Follower Fails to Append Entries Due to Index Validation Race Condition in
> NavigableIndices
> -------------------------------------------------------------------------------------------
>
> Key: RATIS-2278
> URL: https://issues.apache.org/jira/browse/RATIS-2278
> Project: Ratis
> Issue Type: Bug
> Components: Follower
> Reporter: Xinhao GU
> Assignee: Xinhao GU
> Priority: Major
> Fix For: 3.2.0
>
> Attachments: image-2025-04-06-22-56-34-207.png,
> image-2025-04-12-21-39-10-485.png, image.png, output.png
>
> Time Spent: 40m
> Remaining Estimate: 0h
>
> h2. *Problem Description*
> When a follower receives overlapping {{appendEntries}} requests from the
> leader, it may throw an {{IllegalStateException}} during log index
> validation. Specifically, the {{NavigableIndices}} class fails to handle
> concurrent log appends properly, causing the follower to incorrectly reject
> valid entries with an "index mismatch" error.
> h2. *Steps to Reproduce*
> # {*}Cluster Setup{*}: Deploy a Raft group with at least 1 leader and 2
> followers.
> # {*}High Concurrency Scenario{*}: Trigger frequent {{appendEntries}}
> requests from the leader to a follower.
> # {*}Overlapping Requests{*}: Ensure two threads ({{{}thread-1{}}} and
> {{{}thread-2{}}}) process {{appendEntries}} requests simultaneously:
> ** {{thread-1}} appends logs with indices {{16093–16185}} and enters
> {{state.getLog().append()}} but hasn’t invoked {{{}removeExisting(){}}}.
> ** {{thread-2}} attempts to append logs starting at index {{16093}} (from a
> new request) while {{thread-1}} is still processing.
> # {*}Validation Failure{*}: The {{NavigableIndices.append()}} method detects
> a gap between {{{}thread-1{}}}'s last index ({{{}16185{}}}) and
> {{{}thread-2{}}}'s start index ({{{}16093{}}}), throwing an error.
> h2. *Expected Behavior*
> Concurrent {{appendEntries}} requests should be serialized or validated
> atomically to prevent index gaps. Followers should accept logs as long as
> they pass the initial {{ConsecutiveIndices.convert()}} check.
> h2. *Actual Behavior*
>
> {code:java}
> 2025-03-04 05:44:14,946 [7-server-thread2] ERROR
> o.a.r.s.i.RaftServerImpl:1488 - 7@group-000200000000: Failed appendEntries
> 8->7#73-t49,previous=(t:3, i:16092),leaderCommit=16184,initializing?
> false,entries: size=93, first=(t:4, i:16093),
> CONFIGURATIONENTRY(current:id:"6"address:"172.16.2.14:10750"startupRole:FOLLOWER,
> id:"7"address:"172.16.2.15:10750"startupRole:FOLLOWER,
> id:"8"address:"172.16.2.11:10750"startupRole:FOLLOWER, old
> java.lang.IllegalStateException: startIndex: expected == 16186 but computed
> == 16093 at
> org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:77)
> at org.apache.ratis.util.Preconditions.assertSame(Preconditions.java:87)
> at
> org.apache.ratis.server.impl.ServerImplUtils$NavigableIndices.append(ServerImplUtils.java:144)
> at
> org.apache.ratis.server.impl.RaftServerImpl.appendLog(RaftServerImpl.java:1631)
> at
> org.apache.ratis.server.impl.RaftServerImpl.appendEntriesAsync(RaftServerImpl.java:1596)
> at
> org.apache.ratis.server.impl.RaftServerImpl.appendEntriesAsync(RaftServerImpl.java:1486)
> at
> org.apache.ratis.server.impl.RaftServerProxy.lambda$null$28(RaftServerProxy.java:646)
> at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:118)
> at
> org.apache.ratis.server.impl.RaftServerImpl.lambda$executeSubmitServerRequestAsync$10(RaftServerImpl.java:906)
> at
> java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> at java.base/java.lang.Thread.run(Thread.java:833)
> 2025-03-04 05:44:14,950 [grpc-default-executor-8] WARN
> o.a.ratis.util.LogUtils:129 - 7: Failed APPEND_ENTRIES request
> 8->7#73-t49,previous=(t:3, i:16092),leaderCommit=16184,initializing?
> false,entries: size=93, first=(t:4, i:16093),
> CONFIGURATIONENTRY(current:id:"6"address:"172.16.2.14:10750"startupRole:FOLLOWER,
> id:"7"address:"172.16.2.15:10750"startupRole:FOLLOWER,
> id:"8"address:"172.16.2.11:10750"startupRole:FOLLOWER, old
> java.util.concurrent.CompletionException: java.io.IOException:
> java.lang.IllegalStateException: startIndex: expected == 16186 but computed
> == 16093
> at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:122)
> at
> org.apache.ratis.server.impl.RaftServerImpl.lambda$executeSubmitServerRequestAsync$10(RaftServerImpl.java:906)
> at
> java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> at java.base/java.lang.Thread.run(Thread.java:833)
> Caused by: java.io.IOException: java.lang.IllegalStateException: startIndex:
> expected == 16186 but computed == 16093
> at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:56)
> at
> org.apache.ratis.server.impl.RaftServerImpl.appendEntriesAsync(RaftServerImpl.java:1490)
> at
> org.apache.ratis.server.impl.RaftServerProxy.lambda$null$28(RaftServerProxy.java:646)
> at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:118)
> ... 5 common frames omitted
> Caused by: java.lang.IllegalStateException: startIndex: expected == 16186 but
> computed == 16093
> at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:77)
> at org.apache.ratis.util.Preconditions.assertSame(Preconditions.java:87)
> at
> org.apache.ratis.server.impl.ServerImplUtils$NavigableIndices.append(ServerImplUtils.java:144)
> at
> org.apache.ratis.server.impl.RaftServerImpl.appendLog(RaftServerImpl.java:1631)
> at
> org.apache.ratis.server.impl.RaftServerImpl.appendEntriesAsync(RaftServerImpl.java:1596)
> at
> org.apache.ratis.server.impl.RaftServerImpl.appendEntriesAsync(RaftServerImpl.java:1486)
> ... 7 common frames omitted*{code}
> *!image-2025-04-06-22-56-34-207.png|width=1143,height=288!*
>
> h2. *Root Cause Analysis*
> # {*}Thread Race Condition{*}:
> The {{NavigableIndices}} class uses {{synchronized}} methods for {{append()}}
> and {{{}removeExisting(){}}}, but *inter-method concurrency* is not handled.
> If two threads interleave execution (e.g., {{thread-1}} appends logs but
> hasn’t cleaned up {{map}} via {{{}removeExisting(){}}}, while {{thread-2}}
> starts a new append), the index continuity check in {{append()}} fails
> erroneously.
> # {*}Redundant Validation{*}:
> The {{append()}} method revalidates log continuity using the {{{}map{}}}’s
> last entry, even though {{ConsecutiveIndices.convert()}} already ensures
> intra-request index continuity. This creates unnecessary
> concurrency-sensitive checks.
> h2. *Proposed Solutions*
> h3. {*}Option 1{*}: Remove Redundant Validation in {{append()}}
> Modify {{NavigableIndices.append()}} to skip the startIndex check, relying
> solely on {{convert()}} for validation.
> {*}Justification{*}:
> * {{convert()}} already ensures all entries in a single request are
> consecutive.
> * {{removeExisting()}} guarantees cleanup after appending, making
> cross-request checks redundant and race-prone.
> h3. {*}Option 2{*}: Global Lock for {{NavigableIndices'map}}
> Add a global lock for map to serialize all {{append()}} and
> {{removeExisting()}} operations. However, this could introduce performance
> bottlenecks.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)