[
https://issues.apache.org/jira/browse/HDDS-11352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875678#comment-17875678
]
Tsz-wo Sze edited comment on HDDS-11352 at 8/21/24 10:49 PM:
-------------------------------------------------------------
{code}
2024-08-21 07:00:19,708 [omNode-1@group-523986131536-SegmentedRaftLogWorker]
INFO segmented.SegmentedRaftLogWorker
(SegmentedRaftLogWorker.java:execute(637)) -
omNode-1@group-523986131536-SegmentedRaftLogWorker: created new log segment
/home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-b7f92b3c-3189-4adb-a2d3-737d6c7b9dca/omNode-1/ratis/c9bc4cf4-3bc3-3c60-a66b-523986131536/current/log_inprogress_107
2024-08-21 07:00:19,709 [omNode-1-impl-thread1] INFO
segmented.SegmentedRaftLogWorker (SegmentedRaftLogWorker.java:close(248)) -
omNode-1@group-523986131536-SegmentedRaftLogWorker close()
{code}
In the log, SegmentedRaftLogWorker created new log segment and calling close()
in two different threads about the same time.
Checked the code below, it frees the buffer first and cleans up out. The
buffer content can be corrupted and then be flushed to out. It is recent
change by RATIS-2065.
{code}
//SegmentedRaftLogWorker
void close() {
...
PlatformDependent.freeDirectBuffer(writeBuffer);
IOUtils.cleanup(LOG, out);
LOG.info("{} close()", name);
}
{code}
was (Author: szetszwo):
{code}
2024-08-21 07:00:19,708 [omNode-1@group-523986131536-SegmentedRaftLogWorker]
INFO segmented.SegmentedRaftLogWorker
(SegmentedRaftLogWorker.java:execute(637)) -
omNode-1@group-523986131536-SegmentedRaftLogWorker: created new log segment
/home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-b7f92b3c-3189-4adb-a2d3-737d6c7b9dca/omNode-1/ratis/c9bc4cf4-3bc3-3c60-a66b-523986131536/current/log_inprogress_107
2024-08-21 07:00:19,709 [omNode-1-impl-thread1] INFO
segmented.SegmentedRaftLogWorker (SegmentedRaftLogWorker.java:close(248)) -
omNode-1@group-523986131536-SegmentedRaftLogWorker close()
{code}
In the log, SegmentedRaftLogWorker created new log segment and calling close()
in two different threads about the same time.
Checked the code below, it frees the buffer first and call close(). The buffer
content can be corrupted. It is recent change by RATIS-2065.
{code}
//SegmentedRaftLogWorker
void close() {
...
PlatformDependent.freeDirectBuffer(writeBuffer);
IOUtils.cleanup(LOG, out);
LOG.info("{} close()", name);
}
{code}
> Intermittent Raft Log Corruption in TestOzoneManagerHAWithStoppedNodes
> ----------------------------------------------------------------------
>
> Key: HDDS-11352
> URL: https://issues.apache.org/jira/browse/HDDS-11352
> Project: Apache Ozone
> Issue Type: Sub-task
> Components: Ozone Manager
> Reporter: Ethan Rose
> Priority: Critical
> Attachments: it-om.zip
>
>
> Failure observed in [this
> run|https://github.com/apache/ozone/actions/runs/10484629833/job/29039668567]
> in {{TestOzoneManagerHAWithStoppedNodes#testListVolumes}}, but may not be
> specific to that test in particular.
> {code}
> -------------------------------------------------------------------------------
> Test set: org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes
> -------------------------------------------------------------------------------
> Tests run: 12, Failures: 0, Errors: 5, Skipped: 0, Time elapsed: 621.712 s
> <<< FAILURE! - in
> org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes
> org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes.twoOMDown Time
> elapsed: 18.461 s <<< ERROR!
> java.util.concurrent.CompletionException: java.lang.IllegalStateException:
> omNode-1@group-523986131536: Failed to initRaftLog.
> at
> java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332)
> at
> java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347)
> at
> java.base/java.util.concurrent.CompletableFuture$BiRelay.tryFire(CompletableFuture.java:1498)
> at
> java.base/java.util.concurrent.CompletableFuture$CoCompletion.tryFire(CompletableFuture.java:1219)
> at
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
> at
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
> at
> org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:206)
> at
> org.apache.ratis.util.ConcurrentUtils.lambda$null$4(ConcurrentUtils.java:182)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> at java.base/java.lang.Thread.run(Thread.java:840)
> Caused by: java.lang.IllegalStateException: omNode-1@group-523986131536:
> Failed to initRaftLog.
> at
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:171)
> at
> org.apache.ratis.server.impl.ServerState.lambda$new$6(ServerState.java:131)
> at org.apache.ratis.util.MemoizedSupplier.get(MemoizedSupplier.java:63)
> at
> org.apache.ratis.server.impl.ServerState.initialize(ServerState.java:148)
> at
> org.apache.ratis.server.impl.RaftServerImpl.start(RaftServerImpl.java:385)
> at
> org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:203)
> ... 4 more
> Caused by: org.apache.ratis.protocol.exceptions.ChecksumException: Log entry
> corrupted: Calculated checksum is 3AB532B2 but read checksum is 31120F6C.
> at
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:319)
> at
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:204)
> at
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:131)
> at
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:138)
> at
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:172)
> at
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:428)
> at
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:258)
> at
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:231)
> at
> org.apache.ratis.server.raftlog.RaftLogBase.open(RaftLogBase.java:273)
> at
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:194)
> at
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:168)
> ... 9 more
> org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes.testListVolumes
> Time elapsed: 121.075 s <<< ERROR!
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]