[ 
https://issues.apache.org/jira/browse/RATIS-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931721#comment-16931721
 ] 

Tsz Wo Nicholas Sze commented on RATIS-677:
-------------------------------------------

> ... If we ignore the exception while reading a segment file wouldn't that 
> make the log segments inconsistent? ...

[~ljain], the idea is to stop reading the log once an exception has been thrown 
instead of stopping the server.  The log segment in file is corrupted.  We 
won't be able to correct it.  The log segment
in memory will be truncated up to the exception.

> Logentry marked corrupt due to ChecksumException
> ------------------------------------------------
>
>                 Key: RATIS-677
>                 URL: https://issues.apache.org/jira/browse/RATIS-677
>             Project: Ratis
>          Issue Type: Bug
>          Components: server
>            Reporter: Sammi Chen
>            Assignee: Tsz Wo Nicholas Sze
>            Priority: Blocker
>         Attachments: r677_20190913.patch
>
>
> Steps:
> 1.  Run Teragen and generated a few GB data in a 4 datanodes cluster.  
> 2.  Stoped the datanodes through ./stop-ozone.sh.
> 3.  Changed the ozone binaries
> 4.  Start the cluster through ./start-ozone.sh.
> 5.  Two datanode regisisterd to SCM. Two datanode fail to appear at SCM side. 
>  
> Checked these two failed node, datanode process is still running. In the 
> logfile, I found a lot of following errors. 
> 2019-09-12 21:06:45,255 [Datanode State Machine Thread - 0] INFO       - 
> Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO       - 
> Attempting to start container services.
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO       - 
> Background container scanner has been disabled.
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO       - 
> Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] ERROR      - 
> Unable to communicate to SCM server at 10.120.110.183:9861 for past 2100 
> seconds.
> org.apache.ratis.protocol.ChecksumException: LogEntry is corrupt. Calculated 
> checksum is -134141393 but read checksum 0
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:299)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:185)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:121)
>         at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:94)
>         at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:117)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:310)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:234)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:204)
>         at org.apache.ratis.server.raftlog.RaftLog.open(RaftLog.java:247)
>         at 
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:190)
>         at 
> org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:120)
>         at 
> org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:110)
>         at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
>         at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to