[ 
https://issues.apache.org/jira/browse/HDDS-11352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875678#comment-17875678
 ] 

Tsz-wo Sze edited comment on HDDS-11352 at 8/21/24 11:00 PM:
-------------------------------------------------------------

{code}
2024-08-21 07:00:19,708 [omNode-1@group-523986131536-SegmentedRaftLogWorker] 
INFO  segmented.SegmentedRaftLogWorker 
(SegmentedRaftLogWorker.java:execute(637)) - 
omNode-1@group-523986131536-SegmentedRaftLogWorker: created new log segment 
/home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-b7f92b3c-3189-4adb-a2d3-737d6c7b9dca/omNode-1/ratis/c9bc4cf4-3bc3-3c60-a66b-523986131536/current/log_inprogress_107
2024-08-21 07:00:19,709 [omNode-1-impl-thread1] INFO  
segmented.SegmentedRaftLogWorker (SegmentedRaftLogWorker.java:close(248)) - 
omNode-1@group-523986131536-SegmentedRaftLogWorker close()
{code}
In the log, SegmentedRaftLogWorker created new log segment and calling close() 
in two different threads about the same time.

Checked the code below, it frees the buffer first and cleans up out.  The 
buffer content can be corrupted and then be flushed to out.
{code}
//SegmentedRaftLogWorker
  void close() {
    ...
    PlatformDependent.freeDirectBuffer(writeBuffer);
    IOUtils.cleanup(LOG, out);
    LOG.info("{} close()", name);
  }
{code}



was (Author: szetszwo):
{code}
2024-08-21 07:00:19,708 [omNode-1@group-523986131536-SegmentedRaftLogWorker] 
INFO  segmented.SegmentedRaftLogWorker 
(SegmentedRaftLogWorker.java:execute(637)) - 
omNode-1@group-523986131536-SegmentedRaftLogWorker: created new log segment 
/home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-b7f92b3c-3189-4adb-a2d3-737d6c7b9dca/omNode-1/ratis/c9bc4cf4-3bc3-3c60-a66b-523986131536/current/log_inprogress_107
2024-08-21 07:00:19,709 [omNode-1-impl-thread1] INFO  
segmented.SegmentedRaftLogWorker (SegmentedRaftLogWorker.java:close(248)) - 
omNode-1@group-523986131536-SegmentedRaftLogWorker close()
{code}
In the log, SegmentedRaftLogWorker created new log segment and calling close() 
in two different threads about the same time.

Checked the code below, it frees the buffer first and cleans up out.  The 
buffer content can be corrupted and then be flushed to out.  It is recent 
change by RATIS-2065.
{code}
//SegmentedRaftLogWorker
  void close() {
    ...
    PlatformDependent.freeDirectBuffer(writeBuffer);
    IOUtils.cleanup(LOG, out);
    LOG.info("{} close()", name);
  }
{code}


> Intermittent Raft Log Corruption in TestOzoneManagerHAWithStoppedNodes
> ----------------------------------------------------------------------
>
>                 Key: HDDS-11352
>                 URL: https://issues.apache.org/jira/browse/HDDS-11352
>             Project: Apache Ozone
>          Issue Type: Sub-task
>          Components: Ozone Manager
>            Reporter: Ethan Rose
>            Priority: Critical
>         Attachments: it-om.zip
>
>
> Failure observed in [this 
> run|https://github.com/apache/ozone/actions/runs/10484629833/job/29039668567] 
> in {{TestOzoneManagerHAWithStoppedNodes#testListVolumes}}, but may not be 
> specific to that test in particular.
> {code}
> -------------------------------------------------------------------------------
> Test set: org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes
> -------------------------------------------------------------------------------
> Tests run: 12, Failures: 0, Errors: 5, Skipped: 0, Time elapsed: 621.712 s 
> <<< FAILURE! - in 
> org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes
> org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes.twoOMDown  Time 
> elapsed: 18.461 s  <<< ERROR!
> java.util.concurrent.CompletionException: java.lang.IllegalStateException: 
> omNode-1@group-523986131536: Failed to initRaftLog.
>       at 
> java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332)
>       at 
> java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347)
>       at 
> java.base/java.util.concurrent.CompletableFuture$BiRelay.tryFire(CompletableFuture.java:1498)
>       at 
> java.base/java.util.concurrent.CompletableFuture$CoCompletion.tryFire(CompletableFuture.java:1219)
>       at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
>       at 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
>       at 
> org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:206)
>       at 
> org.apache.ratis.util.ConcurrentUtils.lambda$null$4(ConcurrentUtils.java:182)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>       at java.base/java.lang.Thread.run(Thread.java:840)
> Caused by: java.lang.IllegalStateException: omNode-1@group-523986131536: 
> Failed to initRaftLog.
>       at 
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:171)
>       at 
> org.apache.ratis.server.impl.ServerState.lambda$new$6(ServerState.java:131)
>       at org.apache.ratis.util.MemoizedSupplier.get(MemoizedSupplier.java:63)
>       at 
> org.apache.ratis.server.impl.ServerState.initialize(ServerState.java:148)
>       at 
> org.apache.ratis.server.impl.RaftServerImpl.start(RaftServerImpl.java:385)
>       at 
> org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:203)
>       ... 4 more
> Caused by: org.apache.ratis.protocol.exceptions.ChecksumException: Log entry 
> corrupted: Calculated checksum is 3AB532B2 but read checksum is 31120F6C.
>       at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:319)
>       at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:204)
>       at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:131)
>       at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:138)
>       at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:172)
>       at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:428)
>       at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:258)
>       at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:231)
>       at 
> org.apache.ratis.server.raftlog.RaftLogBase.open(RaftLogBase.java:273)
>       at 
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:194)
>       at 
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:168)
>       ... 9 more
> org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes.testListVolumes 
>  Time elapsed: 121.075 s  <<< ERROR!
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to