[ 
https://issues.apache.org/jira/browse/RATIS-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997055#comment-16997055
 ] 

Lokesh Jain edited comment on RATIS-762 at 12/16/19 1:21 PM:
-------------------------------------------------------------

[~szetszwo] Thanks for working on the issue! [~pifta] Thanks for filing the 
Jira and the reviews! I have committed the patch to master branch.


was (Author: ljain):
[~szetszwo] Thanks for working on the issue! [~pifta] Thanks for filing the 
jira! I have committed the patch to master branch.

> Changing server's log corruption policy can lead to assertion error
> -------------------------------------------------------------------
>
>                 Key: RATIS-762
>                 URL: https://issues.apache.org/jira/browse/RATIS-762
>             Project: Ratis
>          Issue Type: Bug
>          Components: server
>            Reporter: Istvan Fajth
>            Assignee: Tsz-wo Sze
>            Priority: Major
>             Fix For: 0.5.0
>
>         Attachments: r762_20191203.patch, r762_20191213.patch
>
>
> After restarting the cluster I have ran into the exception discussed and 
> partly fixed by RATIS-677.
> The exception:
> {code}
> ERROR 
> org.apache.hadoop.ozone.container.common.statemachine.EndpointStateMachine: 
> Unable to communicate to SCM server at <scm_host>:9861 for past X seconds.
> org.apache.ratis.protocol.ChecksumException: Log entry corrupted: Calculated 
> checksum is Y but read checksum is 00000000.
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:312)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129)
>         at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98)
>         at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:134)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:318)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:252)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:221)
>         at org.apache.ratis.server.raftlog.RaftLog.open(RaftLog.java:247)
>         at 
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:191)
>         at 
> org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:121)
>         at 
> org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:113)
>         at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
>         at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> {code}
> The code contains the new option about CorruptionPolicy, however HDDS 
> DataNode does not set this property for the initialised Raft server I will 
> create a separate JIRA for that.
> After I modified the code, and specified the 
> {{raft.server.log.corruption.policy}} as {{WARN_AND_RETURN}}, the code 
> properly printed a warning as:
> {code}
> WARN org.apache.ratis.server.raftlog.segmented.LogSegment: Failed to read 
> segment file 
> /data/3/hadoop-ozone/datanode/ratis/data/9282c3f1-5efa-48e2-8381-af1a20ce871d/current/log_48957-49189
>  (start=48957, end=49189, isOpen? false): only 127 entries read successfully
> org.apache.ratis.protocol.ChecksumException: Log entry corrupted: Calculated 
> checksum is Y but read checksum is 00000000.
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:312)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129)
>         at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98)
>         at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:134)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:318)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:252)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:221)
>         at org.apache.ratis.server.raftlog.RaftLog.open(RaftLog.java:247)
>         at 
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:191)
>         at 
> org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:121)
>         at 
> org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:113)
>         at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
>         at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> {code}
> But about 1.6 seconds later the system throws the following exception, 
> writing the following log lines in between the warning and the exception:
> {code}
> INFO org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 
> 127 entries from segment file 
> /data/3/hadoop-ozone/datanode/ratis/data/9282c3f1-5efa-48e2-8381-af1a20ce871d/current/log_48957-49189
> INFO org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer: Attempting 
> to start container services.
> INFO org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer: Background 
> container scanner has been disabled.
> INFO 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis:
>  Starting XceiverServerRatis ee1c9457-fb78-4a58-8b8d-1a39abee88ff at port 9858
> ERROR 
> org.apache.hadoop.ozone.container.common.statemachine.EndpointStateMachine: 
> Unable to communicate to SCM server at <scm_host>:9861 for past X seconds.
> java.io.IOException: java.lang.IllegalStateException
>         at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:54)
>         at org.apache.ratis.util.IOUtils.toIOException(IOUtils.java:61)
>         at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:70)
>         at 
> org.apache.ratis.server.impl.RaftServerProxy.getImpls(RaftServerProxy.java:284)
>         at 
> org.apache.ratis.server.impl.RaftServerProxy.start(RaftServerProxy.java:296)
>         at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.start(XceiverServerRatis.java:422)
>         at 
> org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.start(OzoneContainer.java:215)
>         at 
> org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:110)
>         at 
> org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:42)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IllegalStateException
>         at 
> org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:36)
>         at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:156)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:318)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:252)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:221)
>         at org.apache.ratis.server.raftlog.RaftLog.open(RaftLog.java:247)
>         at 
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:191)
>         at 
> org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:121)
>         at 
> org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:113)
>         at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
>         at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
>         ... 3 more
> {code}
> So changing the policy effectively just pushes the issue further down in the 
> LogSegment.load code, where there is an assertion that we were able to read 
> until the end, which is not true if there is a checksum error in the meantime.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to