[
https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962159#comment-16962159
]
Marton Elek commented on HDDS-2372:
-----------------------------------
I tried to reproduce it locally with docker-compose. In container state machine
I reduced the capacity of the cache:
{code:java}
stateMachineDataCache = CacheBuilder.newBuilder()
.expireAfterAccess(500, TimeUnit.MILLISECONDS)
// set the limit on no of cached entries equal to no of max threads
// executing writeStateMachineData
.maximumSize(10).build();
{code}
And added a random wait to the readStateMachine:
{code:java}
private ByteString readStateMachineData(
ContainerCommandRequestProto requestProto, long term, long index)
throws IOException {
if (Math.random() > 0.7) {
try {
Thread.sleep(100);
} catch (InterruptedException e) {
e.printStackTrace();
}
} {code}
I got similar, but different error:
{code:java}
-SegmentedRaftLogWorker: created new log segment
/data/metadata/ratis/68c226d2-356c-4eb0-aee2-ce458d4b0095/current/log_inprogress_6872
ESC[32mdatanode_3 |ESC[0m 2019-10-29 15:54:10,084 [pool-7-thread-38] ERROR
- Unable to find the chunk file. chunk info :
ChunkInfo{chunkName='3D4nM8ycqh_testdata_chunk_4366, offset=0, len=1024}
ESC[32mdatanode_3 |ESC[0m 2019-10-29 15:54:10,085 [pool-7-thread-38] INFO
- Operation: ReadChunk : Trace ID:
b93bcdcdd7fd37c:a3bed642046e9e09:b93bcdcdd7fd37c:1 : Message: Unable to find
the chunk file. chunk info ChunkInfo{chunkName='3D4nM8ycqh_testdata_chunk_4366,
offset=0, len=1024} : Result: UNABLE_TO_FIND_CHUNK
ESC[32mdatanode_3 |ESC[0m 2019-10-29 15:54:10,085 [pool-7-thread-38] ERROR
- gid group-CE458D4B0095 : ReadStateMachine failed. cmd ReadChunk logIndex
8773 msg : Unable to find the chunk file. chunk info
ChunkInfo{chunkName='3D4nM8ycqh_testdata_chunk_4366, offset=0, len=1024}
Container Result: UNABLE_TO_FIND_CHUNK
ESC[32mdatanode_3 |ESC[0m 2019-10-29 15:54:10,086 ERROR raftlog.RaftLog:
06f4231d-30a8-42fd-839e-aeaea7b1aa72@group-CE458D4B0095-SegmentedRaftLog:
Failed readStateMachineData for (t:2, i:8773), STATEMACHINELOGENTRY,
client-BCA58E609475, cid=4367
ESC[32mdatanode_3 |ESC[0m java.util.concurrent.ExecutionException:
java.util.concurrent.ExecutionException:
org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
Unable to find the chunk file. chunk info
ChunkInfo{chunkName='3D4nM8ycqh_testdata_chunk_4366, offset=0, len=1024}
ESC[32mdatanode_3 |ESC[0m at
java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395)
ESC[32mdatanode_3 |ESC[0m at
java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2022)
ESC[32mdatanode_3 |ESC[0m at
org.apache.ratis.server.raftlog.RaftLog$EntryWithData.getEntry(RaftLog.java:472)
ESC[32mdatanode_3 |ESC[0m at
org.apache.ratis.util.DataQueue.pollList(DataQueue.java:134)
ESC[32mdatanode_3 |ESC[0m at
org.apache.ratis.server.impl.LogAppender.createRequest(LogAppender.java:220)
ESC[32mdatanode_3 |ESC[0m at
org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:178)
ESC[32mdatanode_3 |ESC[0m at
org.apache.ratis.grpc.server.GrpcLogAppender.runAppenderImpl(GrpcLogAppender.java:121)
ESC[32mdatanode_3 |ESC[0m at
org.apache.ratis.server.impl.LogAppender$AppenderDaemon.run(LogAppender.java:76)
ESC[32mdatanode_3 |ESC[0m at
java.base/java.lang.Thread.run(Thread.java:834) {code}
And the cluster is stuck in a bad state (couln't write any more chunk, ever)
{code:java}
datanode_1 | 2019-10-29 15:54:10,099 INFO impl.RaftServerImpl:
6b9ca1af-467f-40c7-a21d-118cb34080b1@group-CE458D4B0095: inconsistency entries.
Reply:06f4231d-30a8-42fd-839e-aeaea7b1aa72<-6b9ca1af-467f-40c7-a21d-118cb34080b1#0:FAIL,INCONSISTENCY,nextIndex:8773,term:2,followerCommit:8768
{code}
Fix me If I am wrong, but
* I think the write path should work even if the cache is limited or there are
unexpected sleep
* If there are some inconsistencies the raft ring should be healed or closed
and reopened (but it's an independent issue)
> Datanode pipeline is failing with NoSuchFileException
> -----------------------------------------------------
>
> Key: HDDS-2372
> URL: https://issues.apache.org/jira/browse/HDDS-2372
> Project: Hadoop Distributed Data Store
> Issue Type: Bug
> Reporter: Marton Elek
> Priority: Critical
>
> Found it on a k8s based test cluster using a simple 3 node cluster and
> HDDS-2327 freon test. After a while the StateMachine become unhealthy after
> this error:
> {code:java}
> datanode-0 datanode java.util.concurrent.ExecutionException:
> java.util.concurrent.ExecutionException:
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
> java.nio.file.NoSuchFileException:
> /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830
> {code}
> Can be reproduced.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]