[
https://issues.apache.org/jira/browse/RATIS-2278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xinhao GU updated RATIS-2278:
-----------------------------
Description:
h2. *Problem Description*
When a follower receives overlapping {{appendEntries}} requests from the
leader, it may throw an {{IllegalStateException}} during log index validation.
Specifically, the {{NavigableIndices}} class fails to handle concurrent log
appends properly, causing the follower to incorrectly reject valid entries with
an "index mismatch" error.
h2. *Steps to Reproduce*
# {*}Cluster Setup{*}: Deploy a Raft group with at least 1 leader and 2
followers.
# {*}High Concurrency Scenario{*}: Trigger frequent {{appendEntries}} requests
from the leader to a follower.
# {*}Overlapping Requests{*}: Ensure two threads ({{{}thread-1{}}} and
{{{}thread-2{}}}) process {{appendEntries}} requests simultaneously:
** {{thread-1}} appends logs with indices {{16093–16185}} and enters
{{state.getLog().append()}} but hasn’t invoked {{{}removeExisting(){}}}.
** {{thread-2}} attempts to append logs starting at index {{16093}} (from a
new request) while {{thread-1}} is still processing.
# {*}Validation Failure{*}: The {{NavigableIndices.append()}} method detects a
gap between {{{}thread-1{}}}'s last index ({{{}16185{}}}) and
{{{}thread-2{}}}'s start index ({{{}16093{}}}), throwing an error.
h2. *Expected Behavior*
Concurrent {{appendEntries}} requests should be serialized or validated
atomically to prevent index gaps. Followers should accept logs as long as they
pass the initial {{ConsecutiveIndices.convert()}} check.
h2. *Actual Behavior*
{code:java}
2025-03-04 05:44:14,946 [7-server-thread2] ERROR o.a.r.s.i.RaftServerImpl:1488
- 7@group-000200000000: Failed appendEntries 8->7#73-t49,previous=(t:3,
i:16092),leaderCommit=16184,initializing? false,entries: size=93, first=(t:4,
i:16093),
CONFIGURATIONENTRY(current:id:"6"address:"172.16.2.14:10750"startupRole:FOLLOWER,
id:"7"address:"172.16.2.15:10750"startupRole:FOLLOWER,
id:"8"address:"172.16.2.11:10750"startupRole:FOLLOWER, old
java.lang.IllegalStateException: startIndex: expected == 16186 but computed ==
16093 at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:77)
at org.apache.ratis.util.Preconditions.assertSame(Preconditions.java:87)
at
org.apache.ratis.server.impl.ServerImplUtils$NavigableIndices.append(ServerImplUtils.java:144)
at
org.apache.ratis.server.impl.RaftServerImpl.appendLog(RaftServerImpl.java:1631)
at
org.apache.ratis.server.impl.RaftServerImpl.appendEntriesAsync(RaftServerImpl.java:1596)
at
org.apache.ratis.server.impl.RaftServerImpl.appendEntriesAsync(RaftServerImpl.java:1486)
at
org.apache.ratis.server.impl.RaftServerProxy.lambda$null$28(RaftServerProxy.java:646)
at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:118)
at
org.apache.ratis.server.impl.RaftServerImpl.lambda$executeSubmitServerRequestAsync$10(RaftServerImpl.java:906)
at
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
2025-03-04 05:44:14,950 [grpc-default-executor-8] WARN
o.a.ratis.util.LogUtils:129 - 7: Failed APPEND_ENTRIES request
8->7#73-t49,previous=(t:3, i:16092),leaderCommit=16184,initializing?
false,entries: size=93, first=(t:4, i:16093),
CONFIGURATIONENTRY(current:id:"6"address:"172.16.2.14:10750"startupRole:FOLLOWER,
id:"7"address:"172.16.2.15:10750"startupRole:FOLLOWER,
id:"8"address:"172.16.2.11:10750"startupRole:FOLLOWER, old
java.util.concurrent.CompletionException: java.io.IOException:
java.lang.IllegalStateException: startIndex: expected == 16186 but computed ==
16093
at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:122)
at
org.apache.ratis.server.impl.RaftServerImpl.lambda$executeSubmitServerRequestAsync$10(RaftServerImpl.java:906)
at
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.io.IOException: java.lang.IllegalStateException: startIndex:
expected == 16186 but computed == 16093
at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:56)
at
org.apache.ratis.server.impl.RaftServerImpl.appendEntriesAsync(RaftServerImpl.java:1490)
at
org.apache.ratis.server.impl.RaftServerProxy.lambda$null$28(RaftServerProxy.java:646)
at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:118)
... 5 common frames omitted
Caused by: java.lang.IllegalStateException: startIndex: expected == 16186 but
computed == 16093
at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:77)
at org.apache.ratis.util.Preconditions.assertSame(Preconditions.java:87)
at
org.apache.ratis.server.impl.ServerImplUtils$NavigableIndices.append(ServerImplUtils.java:144)
at
org.apache.ratis.server.impl.RaftServerImpl.appendLog(RaftServerImpl.java:1631)
at
org.apache.ratis.server.impl.RaftServerImpl.appendEntriesAsync(RaftServerImpl.java:1596)
at
org.apache.ratis.server.impl.RaftServerImpl.appendEntriesAsync(RaftServerImpl.java:1486)
... 7 common frames omitted*{code}
*!image-2025-04-06-22-56-34-207.png|width=1143,height=288!*
h2. *Root Cause Analysis*
# {*}Thread Race Condition{*}:
The {{NavigableIndices}} class uses {{synchronized}} methods for {{append()}}
and {{{}removeExisting(){}}}, but *inter-method concurrency* is not handled. If
two threads interleave execution (e.g., {{thread-1}} appends logs but hasn’t
cleaned up {{map}} via {{{}removeExisting(){}}}, while {{thread-2}} starts a
new append), the index continuity check in {{append()}} fails erroneously.
# {*}Redundant Validation{*}:
The {{append()}} method revalidates log continuity using the {{{}map{}}}’s last
entry, even though {{ConsecutiveIndices.convert()}} already ensures
intra-request index continuity. This creates unnecessary concurrency-sensitive
checks.
h2. *Proposed Solutions*
h3. {*}Option 1{*}: Remove Redundant Validation in {{append()}}
Modify {{NavigableIndices.append()}} to skip the startIndex check, relying
solely on {{convert()}} for validation.
{*}Justification{*}:
* {{convert()}} already ensures all entries in a single request are
consecutive.
* {{removeExisting()}} guarantees cleanup after appending, making
cross-request checks redundant and race-prone.
h3. {*}Option 2{*}: Global Lock for {{NavigableIndices}}
Add a global lock to serialize all {{append()}} and {{removeExisting()}}
operations. However, this could introduce performance bottlenecks.
was:
h2. *Problem Description*
When a follower receives overlapping {{appendEntries}} requests from the
leader, it may throw an {{IllegalStateException}} during log index validation.
Specifically, the {{NavigableIndices}} class fails to handle concurrent log
appends properly, causing the follower to incorrectly reject valid entries with
an "index mismatch" error.
h2. *Steps to Reproduce*
# {*}Cluster Setup{*}: Deploy a Raft group with at least 1 leader and 2
followers.
# {*}High Concurrency Scenario{*}: Trigger frequent {{appendEntries}} requests
from the leader to a follower.
# {*}Overlapping Requests{*}: Ensure two threads ({{{}thread-1{}}} and
{{{}thread-2{}}}) process {{appendEntries}} requests simultaneously:
** {{thread-1}} appends logs with indices {{16093–16185}} and enters
{{state.getLog().append()}} but hasn’t invoked {{{}removeExisting(){}}}.
** {{thread-2}} attempts to append logs starting at index {{16093}} (from a
new request) while {{thread-1}} is still processing.
# {*}Validation Failure{*}: The {{NavigableIndices.append()}} method detects a
gap between {{{}thread-1{}}}'s last index ({{{}16185{}}}) and
{{{}thread-2{}}}'s start index ({{{}16093{}}}), throwing an error.
h2. *Expected Behavior*
Concurrent {{appendEntries}} requests should be serialized or validated
atomically to prevent index gaps. Followers should accept logs as long as they
pass the initial {{ConsecutiveIndices.convert()}} check.
h2. *Actual Behavior*
{code:java}
{code}
*2025-03-04 05:44:14,946 [7-server-thread2] ERROR o.a.r.s.i.RaftServerImpl:1488
- 7@group-000200000000: Failed appendEntries* 8->7#73-t49,previous=(t:3,
i:16092),leaderCommit=16184,initializing? false,entries: size=93, first=(t:4,
i:16093),
CONFIGURATIONENTRY(current:id:"6"address:"172.16.2.14:10750"startupRole:FOLLOWER,
id:"7"address:"172.16.2.15:10750"startupRole:FOLLOWER,
id:"8"address:"172.16.2.11:10750"startupRole:FOLLOWER, old:)
java.lang.IllegalStateException: {color:#FF0000}startIndex: expected == 16186
but computed == 16093{color} at
org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:77)
at org.apache.ratis.util.Preconditions.assertSame(Preconditions.java:87)
at
org.apache.ratis.server.impl.ServerImplUtils$NavigableIndices.append(ServerImplUtils.java:144)
at
org.apache.ratis.server.impl.RaftServerImpl.appendLog(RaftServerImpl.java:1631)
at
org.apache.ratis.server.impl.RaftServerImpl.appendEntriesAsync(RaftServerImpl.java:1596)
at
org.apache.ratis.server.impl.RaftServerImpl.appendEntriesAsync(RaftServerImpl.java:1486)
at
org.apache.ratis.server.impl.RaftServerProxy.lambda$null$28(RaftServerProxy.java:646)
at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:118)
at
org.apache.ratis.server.impl.RaftServerImpl.lambda$executeSubmitServerRequestAsync$10(RaftServerImpl.java:906)
at
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
2025-03-04 05:44:14,950 [grpc-default-executor-8] WARN
o.a.ratis.util.LogUtils:129 - 7: Failed APPEND_ENTRIES request
8->7#73-t49,previous=(t:3, i:16092),leaderCommit=16184,initializing?
false,entries: size=93, first=(t:4, i:16093),
CONFIGURATIONENTRY(current:id:"6"address:"172.16.2.14:10750"startupRole:FOLLOWER,
id:"7"address:"172.16.2.15:10750"startupRole:FOLLOWER,
id:"8"address:"172.16.2.11:10750"startupRole:FOLLOWER, old:)
java.util.concurrent.CompletionException: java.io.IOException:
java.lang.IllegalStateException: startIndex: expected == 16186 but computed ==
16093
at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:122)
at
org.apache.ratis.server.impl.RaftServerImpl.lambda$executeSubmitServerRequestAsync$10(RaftServerImpl.java:906)
at
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.io.IOException: java.lang.IllegalStateException: startIndex:
expected == 16186 but computed == 16093
at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:56)
at
org.apache.ratis.server.impl.RaftServerImpl.appendEntriesAsync(RaftServerImpl.java:1490)
at
org.apache.ratis.server.impl.RaftServerProxy.lambda$null$28(RaftServerProxy.java:646)
at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:118)
... 5 common frames omitted
Caused by: java.lang.IllegalStateException: startIndex: expected == 16186 but
computed == 16093
at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:77)
at org.apache.ratis.util.Preconditions.assertSame(Preconditions.java:87)
at
org.apache.ratis.server.impl.ServerImplUtils$NavigableIndices.append(ServerImplUtils.java:144)
at
org.apache.ratis.server.impl.RaftServerImpl.appendLog(RaftServerImpl.java:1631)
at
org.apache.ratis.server.impl.RaftServerImpl.appendEntriesAsync(RaftServerImpl.java:1596)
at
org.apache.ratis.server.impl.RaftServerImpl.appendEntriesAsync(RaftServerImpl.java:1486)
... 7 common frames omitted*
*!image-2025-04-06-22-56-34-207.png|width=1143,height=288!*
h2. *Root Cause Analysis*
# {*}Thread Race Condition{*}:
The {{NavigableIndices}} class uses {{synchronized}} methods for {{append()}}
and {{{}removeExisting(){}}}, but *inter-method concurrency* is not handled. If
two threads interleave execution (e.g., {{thread-1}} appends logs but hasn’t
cleaned up {{map}} via {{{}removeExisting(){}}}, while {{thread-2}} starts a
new append), the index continuity check in {{append()}} fails erroneously.
# {*}Redundant Validation{*}:
The {{append()}} method revalidates log continuity using the {{{}map{}}}’s last
entry, even though {{ConsecutiveIndices.convert()}} already ensures
intra-request index continuity. This creates unnecessary concurrency-sensitive
checks.
h2. *Proposed Solutions*
h3. {*}Option 1{*}: Remove Redundant Validation in {{append()}}
Modify {{NavigableIndices.append()}} to skip the startIndex check, relying
solely on {{convert()}} for validation.
{*}Justification{*}:
* {{convert()}} already ensures all entries in a single request are
consecutive.
* {{removeExisting()}} guarantees cleanup after appending, making
cross-request checks redundant and race-prone.
h3. {*}Option 2{*}: Global Lock for {{NavigableIndices}}
Add a global lock to serialize all {{append()}} and {{removeExisting()}}
operations. However, this could introduce performance bottlenecks.
> Follower Fails to Append Entries Due to Index Validation Race Condition in
> NavigableIndices
> -------------------------------------------------------------------------------------------
>
> Key: RATIS-2278
> URL: https://issues.apache.org/jira/browse/RATIS-2278
> Project: Ratis
> Issue Type: Bug
> Components: gRPC, LogService, server
> Reporter: Xinhao GU
> Assignee: Xinhao GU
> Priority: Major
> Attachments: image-2025-04-06-22-56-34-207.png
>
>
> h2. *Problem Description*
> When a follower receives overlapping {{appendEntries}} requests from the
> leader, it may throw an {{IllegalStateException}} during log index
> validation. Specifically, the {{NavigableIndices}} class fails to handle
> concurrent log appends properly, causing the follower to incorrectly reject
> valid entries with an "index mismatch" error.
> h2. *Steps to Reproduce*
> # {*}Cluster Setup{*}: Deploy a Raft group with at least 1 leader and 2
> followers.
> # {*}High Concurrency Scenario{*}: Trigger frequent {{appendEntries}}
> requests from the leader to a follower.
> # {*}Overlapping Requests{*}: Ensure two threads ({{{}thread-1{}}} and
> {{{}thread-2{}}}) process {{appendEntries}} requests simultaneously:
> ** {{thread-1}} appends logs with indices {{16093–16185}} and enters
> {{state.getLog().append()}} but hasn’t invoked {{{}removeExisting(){}}}.
> ** {{thread-2}} attempts to append logs starting at index {{16093}} (from a
> new request) while {{thread-1}} is still processing.
> # {*}Validation Failure{*}: The {{NavigableIndices.append()}} method detects
> a gap between {{{}thread-1{}}}'s last index ({{{}16185{}}}) and
> {{{}thread-2{}}}'s start index ({{{}16093{}}}), throwing an error.
> h2. *Expected Behavior*
> Concurrent {{appendEntries}} requests should be serialized or validated
> atomically to prevent index gaps. Followers should accept logs as long as
> they pass the initial {{ConsecutiveIndices.convert()}} check.
> h2. *Actual Behavior*
>
> {code:java}
> 2025-03-04 05:44:14,946 [7-server-thread2] ERROR
> o.a.r.s.i.RaftServerImpl:1488 - 7@group-000200000000: Failed appendEntries
> 8->7#73-t49,previous=(t:3, i:16092),leaderCommit=16184,initializing?
> false,entries: size=93, first=(t:4, i:16093),
> CONFIGURATIONENTRY(current:id:"6"address:"172.16.2.14:10750"startupRole:FOLLOWER,
> id:"7"address:"172.16.2.15:10750"startupRole:FOLLOWER,
> id:"8"address:"172.16.2.11:10750"startupRole:FOLLOWER, old
> java.lang.IllegalStateException: startIndex: expected == 16186 but computed
> == 16093 at
> org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:77)
> at org.apache.ratis.util.Preconditions.assertSame(Preconditions.java:87)
> at
> org.apache.ratis.server.impl.ServerImplUtils$NavigableIndices.append(ServerImplUtils.java:144)
> at
> org.apache.ratis.server.impl.RaftServerImpl.appendLog(RaftServerImpl.java:1631)
> at
> org.apache.ratis.server.impl.RaftServerImpl.appendEntriesAsync(RaftServerImpl.java:1596)
> at
> org.apache.ratis.server.impl.RaftServerImpl.appendEntriesAsync(RaftServerImpl.java:1486)
> at
> org.apache.ratis.server.impl.RaftServerProxy.lambda$null$28(RaftServerProxy.java:646)
> at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:118)
> at
> org.apache.ratis.server.impl.RaftServerImpl.lambda$executeSubmitServerRequestAsync$10(RaftServerImpl.java:906)
> at
> java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> at java.base/java.lang.Thread.run(Thread.java:833)
> 2025-03-04 05:44:14,950 [grpc-default-executor-8] WARN
> o.a.ratis.util.LogUtils:129 - 7: Failed APPEND_ENTRIES request
> 8->7#73-t49,previous=(t:3, i:16092),leaderCommit=16184,initializing?
> false,entries: size=93, first=(t:4, i:16093),
> CONFIGURATIONENTRY(current:id:"6"address:"172.16.2.14:10750"startupRole:FOLLOWER,
> id:"7"address:"172.16.2.15:10750"startupRole:FOLLOWER,
> id:"8"address:"172.16.2.11:10750"startupRole:FOLLOWER, old
> java.util.concurrent.CompletionException: java.io.IOException:
> java.lang.IllegalStateException: startIndex: expected == 16186 but computed
> == 16093
> at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:122)
> at
> org.apache.ratis.server.impl.RaftServerImpl.lambda$executeSubmitServerRequestAsync$10(RaftServerImpl.java:906)
> at
> java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> at java.base/java.lang.Thread.run(Thread.java:833)
> Caused by: java.io.IOException: java.lang.IllegalStateException: startIndex:
> expected == 16186 but computed == 16093
> at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:56)
> at
> org.apache.ratis.server.impl.RaftServerImpl.appendEntriesAsync(RaftServerImpl.java:1490)
> at
> org.apache.ratis.server.impl.RaftServerProxy.lambda$null$28(RaftServerProxy.java:646)
> at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:118)
> ... 5 common frames omitted
> Caused by: java.lang.IllegalStateException: startIndex: expected == 16186 but
> computed == 16093
> at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:77)
> at org.apache.ratis.util.Preconditions.assertSame(Preconditions.java:87)
> at
> org.apache.ratis.server.impl.ServerImplUtils$NavigableIndices.append(ServerImplUtils.java:144)
> at
> org.apache.ratis.server.impl.RaftServerImpl.appendLog(RaftServerImpl.java:1631)
> at
> org.apache.ratis.server.impl.RaftServerImpl.appendEntriesAsync(RaftServerImpl.java:1596)
> at
> org.apache.ratis.server.impl.RaftServerImpl.appendEntriesAsync(RaftServerImpl.java:1486)
> ... 7 common frames omitted*{code}
> *!image-2025-04-06-22-56-34-207.png|width=1143,height=288!*
>
> h2. *Root Cause Analysis*
> # {*}Thread Race Condition{*}:
> The {{NavigableIndices}} class uses {{synchronized}} methods for {{append()}}
> and {{{}removeExisting(){}}}, but *inter-method concurrency* is not handled.
> If two threads interleave execution (e.g., {{thread-1}} appends logs but
> hasn’t cleaned up {{map}} via {{{}removeExisting(){}}}, while {{thread-2}}
> starts a new append), the index continuity check in {{append()}} fails
> erroneously.
> # {*}Redundant Validation{*}:
> The {{append()}} method revalidates log continuity using the {{{}map{}}}’s
> last entry, even though {{ConsecutiveIndices.convert()}} already ensures
> intra-request index continuity. This creates unnecessary
> concurrency-sensitive checks.
> h2. *Proposed Solutions*
> h3. {*}Option 1{*}: Remove Redundant Validation in {{append()}}
> Modify {{NavigableIndices.append()}} to skip the startIndex check, relying
> solely on {{convert()}} for validation.
> {*}Justification{*}:
> * {{convert()}} already ensures all entries in a single request are
> consecutive.
> * {{removeExisting()}} guarantees cleanup after appending, making
> cross-request checks redundant and race-prone.
> h3. {*}Option 2{*}: Global Lock for {{NavigableIndices}}
> Add a global lock to serialize all {{append()}} and {{removeExisting()}}
> operations. However, this could introduce performance bottlenecks.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)