[
https://issues.apache.org/jira/browse/RATIS-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tsz-wo Sze resolved RATIS-2059.
-------------------------------
Fix Version/s: 3.1.0
Resolution: Fixed
The pull request is now merged. Thanks, [~duongnguyen]!
> Missing reference count when putting log entries to cache on follower
> ---------------------------------------------------------------------
>
> Key: RATIS-2059
> URL: https://issues.apache.org/jira/browse/RATIS-2059
> Project: Ratis
> Issue Type: Sub-task
> Affects Versions: 3.1.0
> Reporter: Duong
> Assignee: Duong
> Priority: Major
> Fix For: 3.1.0
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
>
> Found this issue when running Ozone integ test with zero-copy. Ref CI:
> [https://github.com/duongkame/ozone/actions/runs/8655610655/job/23739252086]
> Problem summary: on the follower side, the LogEntry's clientId buffer is
> reset (to all zero) while the log entry is kept in RaftLog cache. This cause
> an exception.
> {code:java}
> org.apache.ratis.thirdparty.com.google.common.util.concurrent.UncheckedExecutionException:
> java.lang.IllegalStateException: Failed to create ClientId: UUID
> 00000000-0000-0000-0000-000000000000 is reserved.
> at
> org.apache.ratis.thirdparty.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2087)
> at
> org.apache.ratis.thirdparty.com.google.common.cache.LocalCache.get(LocalCache.java:4019)
> at
> org.apache.ratis.thirdparty.com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4933)
> at org.apache.ratis.protocol.RaftId$Factory.valueOf(RaftId.java:64)
> at org.apache.ratis.protocol.RaftId$Factory.valueOf(RaftId.java:71)
> at org.apache.ratis.protocol.ClientId.valueOf(ClientId.java:45)
> at
> org.apache.ratis.protocol.ClientInvocationId.valueOf(ClientInvocationId.java:41)
> at
> org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1825)
> at
> org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:259)
> at
> org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:193)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IllegalStateException: Failed to create ClientId: UUID
> 00000000-0000-0000-0000-000000000000 is reserved.
> at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:77)
> at org.apache.ratis.protocol.RaftId.<init>(RaftId.java:91)
> at org.apache.ratis.protocol.ClientId.<init>(ClientId.java:53)
> at org.apache.ratis.protocol.ClientId.<init>(ClientId.java:28)
> at org.apache.ratis.protocol.ClientId$1.newInstance(ClientId.java:32)
> at org.apache.ratis.protocol.ClientId$1.newInstance(ClientId.java:29)
> at
> org.apache.ratis.protocol.RaftId$Factory.lambda$valueOf$0(RaftId.java:64)
> at
> org.apache.ratis.thirdparty.com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4938)
> at
> org.apache.ratis.thirdparty.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3576)
> at
> org.apache.ratis.thirdparty.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2318)
> at
> org.apache.ratis.thirdparty.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2191)
> at
> org.apache.ratis.thirdparty.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2081)
> {code}
>
> From RATIS-1979, when stateMachineCacheEnabled, the log entry is copied and
> decoupled from the original log entry reference counter.
> {code:java}
> final LogEntryProto removedStateMachineData =
> LogProtoUtils.removeStateMachineData(entry);
> ....
> if (stateMachineCachingEnabled) {
> // The stateMachineData will be cached inside the StateMachine itself.
> cache.appendEntry(LogSegment.Op.WRITE_CACHE_WITH_STATE_MACHINE_CACHE,
> ReferenceCountedObject.wrap(removedStateMachineData));
> } else {
>
> cache.appendEntry(LogSegment.Op.WRITE_CACHE_WITHOUT_STATE_MACHINE_CACHE,
> entryRef
> );
> }
> {code}
> The above implementation is problematic because in some cases,
> *removedStateMachineData* is not a copy but the original entry itself.
> # When the entry is a configuration or a metadata entry.
> # When the entry is a *stateMachine data* and is replicated from leader to
> follower. Somehow, the state machine data ends up in
> *LogEntryProto.stateMachineLogEntry.logData* (and not
> {*}LogEntryProto.stateMachineLogEntry.stateMachineEntry.stateMachineData{*}).
> This makes the logic of removeStateMachineData completely skip the removal of
> state machine data.
> #2 is very concerning and I'm not sure it is a regression from another
> change. It makes it inconsistent that state machine data is available
> applyTransaction for followers (not the case for leader).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)