[
https://issues.apache.org/jira/browse/HDDS-11352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875678#comment-17875678
]
Tsz-wo Sze commented on HDDS-11352:
-----------------------------------
{code}
2024-08-21 07:00:19,708 [omNode-1@group-523986131536-SegmentedRaftLogWorker]
INFO segmented.SegmentedRaftLogWorker
(SegmentedRaftLogWorker.java:execute(637)) -
omNode-1@group-523986131536-SegmentedRaftLogWorker: created new log segment
/home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-b7f92b3c-3189-4adb-a2d3-737d6c7b9dca/omNode-1/ratis/c9bc4cf4-3bc3-3c60-a66b-523986131536/current/log_inprogress_107
2024-08-21 07:00:19,709 [omNode-1-impl-thread1] INFO
segmented.SegmentedRaftLogWorker (SegmentedRaftLogWorker.java:close(248)) -
omNode-1@group-523986131536-SegmentedRaftLogWorker close()
{code}
In the log, SegmentedRaftLogWorker created new log segment and calling close()
in two different threads about the same time.
Checked the code below, it frees the buffer first and call close(). The buffer
content can be corrupted. It is recent change by RATIS-2065.
{code}
//SegmentedRaftLogWorker
void close() {
...
PlatformDependent.freeDirectBuffer(writeBuffer);
IOUtils.cleanup(LOG, out);
LOG.info("{} close()", name);
}
{code}
> Intermittent Raft Log Corruption in TestOzoneManagerHAWithStoppedNodes
> ----------------------------------------------------------------------
>
> Key: HDDS-11352
> URL: https://issues.apache.org/jira/browse/HDDS-11352
> Project: Apache Ozone
> Issue Type: Sub-task
> Components: Ozone Manager
> Reporter: Ethan Rose
> Priority: Critical
> Attachments: it-om.zip
>
>
> Failure observed in [this
> run|https://github.com/apache/ozone/actions/runs/10484629833/job/29039668567]
> in {{TestOzoneManagerHAWithStoppedNodes#testListVolumes}}, but may not be
> specific to that test in particular.
> {code}
> -------------------------------------------------------------------------------
> Test set: org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes
> -------------------------------------------------------------------------------
> Tests run: 12, Failures: 0, Errors: 5, Skipped: 0, Time elapsed: 621.712 s
> <<< FAILURE! - in
> org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes
> org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes.twoOMDown Time
> elapsed: 18.461 s <<< ERROR!
> java.util.concurrent.CompletionException: java.lang.IllegalStateException:
> omNode-1@group-523986131536: Failed to initRaftLog.
> at
> java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332)
> at
> java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347)
> at
> java.base/java.util.concurrent.CompletableFuture$BiRelay.tryFire(CompletableFuture.java:1498)
> at
> java.base/java.util.concurrent.CompletableFuture$CoCompletion.tryFire(CompletableFuture.java:1219)
> at
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
> at
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
> at
> org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:206)
> at
> org.apache.ratis.util.ConcurrentUtils.lambda$null$4(ConcurrentUtils.java:182)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> at java.base/java.lang.Thread.run(Thread.java:840)
> Caused by: java.lang.IllegalStateException: omNode-1@group-523986131536:
> Failed to initRaftLog.
> at
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:171)
> at
> org.apache.ratis.server.impl.ServerState.lambda$new$6(ServerState.java:131)
> at org.apache.ratis.util.MemoizedSupplier.get(MemoizedSupplier.java:63)
> at
> org.apache.ratis.server.impl.ServerState.initialize(ServerState.java:148)
> at
> org.apache.ratis.server.impl.RaftServerImpl.start(RaftServerImpl.java:385)
> at
> org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:203)
> ... 4 more
> Caused by: org.apache.ratis.protocol.exceptions.ChecksumException: Log entry
> corrupted: Calculated checksum is 3AB532B2 but read checksum is 31120F6C.
> at
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:319)
> at
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:204)
> at
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:131)
> at
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:138)
> at
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:172)
> at
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:428)
> at
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:258)
> at
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:231)
> at
> org.apache.ratis.server.raftlog.RaftLogBase.open(RaftLogBase.java:273)
> at
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:194)
> at
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:168)
> ... 9 more
> org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes.testListVolumes
> Time elapsed: 121.075 s <<< ERROR!
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]