[ 
https://issues.apache.org/jira/browse/RATIS-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved RATIS-2059.
-------------------------------
    Fix Version/s: 3.1.0
       Resolution: Fixed

The pull request is now merged. Thanks, [~duongnguyen]!

> Missing reference count when putting log entries to cache on follower
> ---------------------------------------------------------------------
>
>                 Key: RATIS-2059
>                 URL: https://issues.apache.org/jira/browse/RATIS-2059
>             Project: Ratis
>          Issue Type: Sub-task
>    Affects Versions: 3.1.0
>            Reporter: Duong
>            Assignee: Duong
>            Priority: Major
>             Fix For: 3.1.0
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
>  
> Found this issue when running Ozone integ test with zero-copy. Ref CI: 
> [https://github.com/duongkame/ozone/actions/runs/8655610655/job/23739252086]
> Problem summary: on the follower side, the LogEntry's clientId buffer is 
> reset (to all zero) while the log entry is kept in RaftLog cache. This cause 
> an exception.
> {code:java}
> org.apache.ratis.thirdparty.com.google.common.util.concurrent.UncheckedExecutionException:
>  java.lang.IllegalStateException: Failed to create ClientId: UUID 
> 00000000-0000-0000-0000-000000000000 is reserved.
>     at 
> org.apache.ratis.thirdparty.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2087)
>     at 
> org.apache.ratis.thirdparty.com.google.common.cache.LocalCache.get(LocalCache.java:4019)
>     at 
> org.apache.ratis.thirdparty.com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4933)
>     at org.apache.ratis.protocol.RaftId$Factory.valueOf(RaftId.java:64)
>     at org.apache.ratis.protocol.RaftId$Factory.valueOf(RaftId.java:71)
>     at org.apache.ratis.protocol.ClientId.valueOf(ClientId.java:45)
>     at 
> org.apache.ratis.protocol.ClientInvocationId.valueOf(ClientInvocationId.java:41)
>     at 
> org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1825)
>     at 
> org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:259)
>     at 
> org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:193)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IllegalStateException: Failed to create ClientId: UUID 
> 00000000-0000-0000-0000-000000000000 is reserved.
>     at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:77)
>     at org.apache.ratis.protocol.RaftId.<init>(RaftId.java:91)
>     at org.apache.ratis.protocol.ClientId.<init>(ClientId.java:53)
>     at org.apache.ratis.protocol.ClientId.<init>(ClientId.java:28)
>     at org.apache.ratis.protocol.ClientId$1.newInstance(ClientId.java:32)
>     at org.apache.ratis.protocol.ClientId$1.newInstance(ClientId.java:29)
>     at 
> org.apache.ratis.protocol.RaftId$Factory.lambda$valueOf$0(RaftId.java:64)
>     at 
> org.apache.ratis.thirdparty.com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4938)
>     at 
> org.apache.ratis.thirdparty.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3576)
>     at 
> org.apache.ratis.thirdparty.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2318)
>     at 
> org.apache.ratis.thirdparty.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2191)
>     at 
> org.apache.ratis.thirdparty.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2081)
>  {code}
>  
> From RATIS-1979, when stateMachineCacheEnabled, the log entry is copied and 
> decoupled from the original log entry reference counter.
> {code:java}
>      final LogEntryProto removedStateMachineData = 
> LogProtoUtils.removeStateMachineData(entry);
>     ....
>      if (stateMachineCachingEnabled) {
>         // The stateMachineData will be cached inside the StateMachine itself.
>         cache.appendEntry(LogSegment.Op.WRITE_CACHE_WITH_STATE_MACHINE_CACHE,
>             ReferenceCountedObject.wrap(removedStateMachineData));
>       } else {
>         
> cache.appendEntry(LogSegment.Op.WRITE_CACHE_WITHOUT_STATE_MACHINE_CACHE, 
> entryRef
>         );
>       }
> {code}
> The above implementation is problematic because in some cases, 
> *removedStateMachineData* is not a copy but the original entry itself. 
>  # When the entry is a configuration or a metadata entry.
>  # When the entry is a *stateMachine data* and is replicated from leader to 
> follower. Somehow, the state machine data ends up in 
> *LogEntryProto.stateMachineLogEntry.logData* (and not 
> {*}LogEntryProto.stateMachineLogEntry.stateMachineEntry.stateMachineData{*}). 
> This makes the logic of removeStateMachineData completely skip the removal of 
> state machine data.
>  #2 is very concerning and I'm not sure it is a regression from another 
> change. It makes it inconsistent that state machine data is available 
> applyTransaction for followers (not the case for leader).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to