[ https://issues.apache.org/jira/browse/RATIS-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997055#comment-16997055 ]
Lokesh Jain edited comment on RATIS-762 at 12/16/19 1:21 PM: ------------------------------------------------------------- [~szetszwo] Thanks for working on the issue! [~pifta] Thanks for filing the Jira and the reviews! I have committed the patch to master branch. was (Author: ljain): [~szetszwo] Thanks for working on the issue! [~pifta] Thanks for filing the jira! I have committed the patch to master branch. > Changing server's log corruption policy can lead to assertion error > ------------------------------------------------------------------- > > Key: RATIS-762 > URL: https://issues.apache.org/jira/browse/RATIS-762 > Project: Ratis > Issue Type: Bug > Components: server > Reporter: Istvan Fajth > Assignee: Tsz-wo Sze > Priority: Major > Fix For: 0.5.0 > > Attachments: r762_20191203.patch, r762_20191213.patch > > > After restarting the cluster I have ran into the exception discussed and > partly fixed by RATIS-677. > The exception: > {code} > ERROR > org.apache.hadoop.ozone.container.common.statemachine.EndpointStateMachine: > Unable to communicate to SCM server at <scm_host>:9861 for past X seconds. > org.apache.ratis.protocol.ChecksumException: Log entry corrupted: Calculated > checksum is Y but read checksum is 00000000. > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:312) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129) > at > org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98) > at > org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:134) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:318) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:252) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:221) > at org.apache.ratis.server.raftlog.RaftLog.open(RaftLog.java:247) > at > org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:191) > at > org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:121) > at > org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:113) > at > org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208) > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > The code contains the new option about CorruptionPolicy, however HDDS > DataNode does not set this property for the initialised Raft server I will > create a separate JIRA for that. > After I modified the code, and specified the > {{raft.server.log.corruption.policy}} as {{WARN_AND_RETURN}}, the code > properly printed a warning as: > {code} > WARN org.apache.ratis.server.raftlog.segmented.LogSegment: Failed to read > segment file > /data/3/hadoop-ozone/datanode/ratis/data/9282c3f1-5efa-48e2-8381-af1a20ce871d/current/log_48957-49189 > (start=48957, end=49189, isOpen? false): only 127 entries read successfully > org.apache.ratis.protocol.ChecksumException: Log entry corrupted: Calculated > checksum is Y but read checksum is 00000000. > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:312) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129) > at > org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98) > at > org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:134) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:318) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:252) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:221) > at org.apache.ratis.server.raftlog.RaftLog.open(RaftLog.java:247) > at > org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:191) > at > org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:121) > at > org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:113) > at > org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208) > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > But about 1.6 seconds later the system throws the following exception, > writing the following log lines in between the warning and the exception: > {code} > INFO org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read > 127 entries from segment file > /data/3/hadoop-ozone/datanode/ratis/data/9282c3f1-5efa-48e2-8381-af1a20ce871d/current/log_48957-49189 > INFO org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer: Attempting > to start container services. > INFO org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer: Background > container scanner has been disabled. > INFO > org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis: > Starting XceiverServerRatis ee1c9457-fb78-4a58-8b8d-1a39abee88ff at port 9858 > ERROR > org.apache.hadoop.ozone.container.common.statemachine.EndpointStateMachine: > Unable to communicate to SCM server at <scm_host>:9861 for past X seconds. > java.io.IOException: java.lang.IllegalStateException > at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:54) > at org.apache.ratis.util.IOUtils.toIOException(IOUtils.java:61) > at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:70) > at > org.apache.ratis.server.impl.RaftServerProxy.getImpls(RaftServerProxy.java:284) > at > org.apache.ratis.server.impl.RaftServerProxy.start(RaftServerProxy.java:296) > at > org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.start(XceiverServerRatis.java:422) > at > org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.start(OzoneContainer.java:215) > at > org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:110) > at > org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:42) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.IllegalStateException > at > org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:36) > at > org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:156) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:318) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:252) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:221) > at org.apache.ratis.server.raftlog.RaftLog.open(RaftLog.java:247) > at > org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:191) > at > org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:121) > at > org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:113) > at > org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208) > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) > ... 3 more > {code} > So changing the policy effectively just pushes the issue further down in the > LogSegment.load code, where there is an assertion that we were able to read > until the end, which is not true if there is a checksum error in the meantime. -- This message was sent by Atlassian Jira (v8.3.4#803005)