[
https://issues.apache.org/jira/browse/RATIS-804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025848#comment-17025848
]
Marton Elek commented on RATIS-804:
-----------------------------------
The problematic segment is this (in CacheInvalidationPolicy.java):
{code:java}
if (result.isEmpty()) {
for (int i = safeIndex; i >= j; i--) {
LogSegment s = segments.get(i);
if (s.getStartIndex() > lastAppliedIndex && s.hasCache()) {
result.add(s);
break;
}
}
} {code}
This is the last segment in the algorithm. The evictImpl:
# First checks which segments are not flushed. They should be kept
# (In case of follower) Which segments are already applied
# (In case of follower and the no segments to remove until this point):
*Remove the segments between the lastAppliedIndex and the localFlushIndex* with
the hope that it can be loaded any time. It can, but only with locks.
> Race condition between cache evict and load in LogSegment
> ---------------------------------------------------------
>
> Key: RATIS-804
> URL: https://issues.apache.org/jira/browse/RATIS-804
> Project: Ratis
> Issue Type: Bug
> Reporter: Marton Elek
> Priority: Critical
>
> I am doing some kind of stress testing with Ozone. I start one Datanode in
> FOLLOWER mode and the load generator (Freon) behaves like a LEADER.
> I am sending huge number of AppendLogEntries to the FOLLOWER without
> inhibitions.
> As a result I got NPE:
> {code:java}
> 2020-01-28 15:08:20 ERROR StateMachineUpdater:184 -
> 3fda0c39-ce3c-4540-a804-44d9ac1f4853@group-E1B13B4CA5C0-StateMachineUpdater:
> the StateMachineUp
> dater hits Throwable
> org.apache.ratis.server.raftlog.RaftLogIOException:
> java.lang.NullPointerException
> at
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:320)
> at
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:293)
> at
> org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:218)
> at
> org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:167)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
> at java.util.Objects.requireNonNull(Objects.java:203)
> at
> org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:214)
> at
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:318)
> ... 4 more {code}
> It seems to be a race condition between LogSegment.evictCache() and
> LogSegment.loadCache().
> # StateMachineUpdater tries to update the StateMachine with the next log
> entry
> # It can't be found in the cache, therefore the LogSegment.loadCache() is
> called
> # The LogSegment.LogEntryLoader.load() reads the segment files from the disk
> # After loading, it returns with the loaded entry
> If the GRPC thread evicts the cache between 3 and 4. (it's possible that the
> log segment is already flushed, therefore can be evicted) an NPE will be
> thrown.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)