[
https://issues.apache.org/jira/browse/HDDS-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sammi Chen updated HDDS-2118:
-----------------------------
Description:
Steps:
1. Run Teragen and generated a few GB data in a 4 datanodes cluster.
2. Stoped the datanodes through ./stop-ozone.sh.
3. Changed the ozone binaries
4. Start the cluster through ./start-ozone.sh.
5. Two datanode regisisterd to SCM. Two datanode fail to appear at SCM side.
Checked these two failed node, datanode process is still running. In the
logfile, I found a lot of following errors.
2019-09-12 21:06:45,255 [Datanode State Machine Thread - 0] INFO -
Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO -
Attempting to start container services.
2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO -
Background container scanner has been disabled.
2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO -
Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] ERROR - Unable
to communicate to SCM server at 10.120.110.183:9861 for past 2100 seconds.
org.apache.ratis.protocol.ChecksumException: LogEntry is corrupt. Calculated
checksum is -134141393 but read checksum 0
at
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:299)
at
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:185)
at
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:121)
at
org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:94)
at
org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:117)
at
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:310)
at
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:234)
at
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:204)
at org.apache.ratis.server.raftlog.RaftLog.open(RaftLog.java:247)
at
org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:190)
at org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:120)
at
org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:110)
at
org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
was:
2019-09-12 21:06:45,255 [Datanode State Machine Thread - 0] INFO -
Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO -
Attempting to start container services.
2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO -
Background container scanner has been disabled.
2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO -
Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] ERROR - Unable
to communicate to SCM server at 10.120.110.183:9861 for past 2100 seconds.
org.apache.ratis.protocol.ChecksumException: LogEntry is corrupt. Calculated
checksum is -134141393 but read checksum 0
at
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:299)
at
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:185)
at
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:121)
at
org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:94)
at
org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:117)
at
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:310)
at
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:234)
at
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:204)
at org.apache.ratis.server.raftlog.RaftLog.open(RaftLog.java:247)
at
org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:190)
at org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:120)
at
org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:110)
at
org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
> Datanode fail to start after stop
> ---------------------------------
>
> Key: HDDS-2118
> URL: https://issues.apache.org/jira/browse/HDDS-2118
> Project: Hadoop Distributed Data Store
> Issue Type: Bug
> Reporter: Sammi Chen
> Priority: Major
>
> Steps:
> 1. Run Teragen and generated a few GB data in a 4 datanodes cluster.
> 2. Stoped the datanodes through ./stop-ozone.sh.
> 3. Changed the ozone binaries
> 4. Start the cluster through ./start-ozone.sh.
> 5. Two datanode regisisterd to SCM. Two datanode fail to appear at SCM side.
>
> Checked these two failed node, datanode process is still running. In the
> logfile, I found a lot of following errors.
> 2019-09-12 21:06:45,255 [Datanode State Machine Thread - 0] INFO -
> Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO -
> Attempting to start container services.
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO -
> Background container scanner has been disabled.
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO -
> Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] ERROR -
> Unable to communicate to SCM server at 10.120.110.183:9861 for past 2100
> seconds.
> org.apache.ratis.protocol.ChecksumException: LogEntry is corrupt. Calculated
> checksum is -134141393 but read checksum 0
> at
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:299)
> at
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:185)
> at
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:121)
> at
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:94)
> at
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:117)
> at
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:310)
> at
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:234)
> at
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:204)
> at org.apache.ratis.server.raftlog.RaftLog.open(RaftLog.java:247)
> at
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:190)
> at
> org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:120)
> at
> org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:110)
> at
> org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
> at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]