[
https://issues.apache.org/jira/browse/HDDS-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083949#comment-17083949
]
Marton Elek commented on HDDS-3257:
-----------------------------------
We analyzed a few with [~shashikant] and [~ljain]. It seems to be an asymmetric
communication problem between leader and follower:
Usually the last leader is elected with 1 vote (instead of 2):
{code}
2020-04-07 05:39:52,800
[10efa0c0-b6d1-45a1-855e-ef1ad741a71c@group-9C1A66D474A0-LeaderElection7] INFO
impl.LeaderElection (LeaderElection.java:logAndReturn(61)) -
10efa0c0-b6d1-45a1-855e-ef1ad741a71c@group-9C1A66D474A0-LeaderElection7:
Election PASSED; received 1 response(s)
[10efa0c0-b6d1-45a1-855e-ef1ad741a71c<-bd7acc78-d58c-493d-9f90-aa900375b793#0:OK-t2]
and 0 exception(s);
10efa0c0-b6d1-45a1-855e-ef1ad741a71c@group-9C1A66D474A0:t2, leader=null,
voted=10efa0c0-b6d1-45a1-855e-ef1ad741a71c,
raftlog=10efa0c0-b6d1-45a1-855e-ef1ad741a71c@group-9C1A66D474A0-SegmentedRaftLog:OPENED:c-1,f-1,i0,
conf=-1: [d6a790ea-9667-4c35-b496-e28617be47e4:172.17.0.2:41539,
10efa0c0-b6d1-45a1-855e-ef1ad741a71c:172.17.0.2:45457,
bd7acc78-d58c-493d-9f90-aa900375b793:172.17.0.2:36205], old=null
{code}
But the follower which didn't vote, receive the message from the leader:
{code}
2020-04-07 05:39:52,861 [grpc-default-executor-0] INFO impl.RaftServerImpl
(ServerState.java:setLeader(255)) -
d6a790ea-9667-4c35-b496-e28617be47e4@group-9C1A66D474A0: change Leader from
null to 10efa0c0-b6d1-45a1-855e-ef1ad741a71c at term 2 for appendEntries,
leader elected after 10407ms
{code}
And after the 1 minute timeout multiple append log entries are timing out:
{code}
2020-04-07 05:41:05,247
[java.util.concurrent.ThreadPoolExecutor$Worker@228750d3[State = -1, empty
queue]] WARN server.GrpcLogAppender
(GrpcLogAppender.java:timeoutAppendRequest(212)) -
10efa0c0-b6d1-45a1-855e-ef1ad741a71c@group-9C
1A66D474A0->d6a790ea-9667-4c35-b496-e28617be47e4-GrpcLogAppender:
appendEntries Timeout,
request=AppendEntriesRequest:cid=39,entriesCount=1,lastEntry=(t:2, i:34)
{code}
> Intermittent timeout in integration tests
> -----------------------------------------
>
> Key: HDDS-3257
> URL: https://issues.apache.org/jira/browse/HDDS-3257
> Project: Hadoop Distributed Data Store
> Issue Type: Bug
> Reporter: Attila Doroszlai
> Assignee: Shashikant Banerjee
> Priority: Critical
> Attachments:
> org.apache.hadoop.fs.ozone.contract.ITestOzoneContractMkdir-output.txt,
> org.apache.hadoop.ozone.client.rpc.TestBlockOutputStreamWithFailures-output.txt,
> org.apache.hadoop.ozone.freon.TestOzoneClientKeyGenerator-output.txt,
> org.apache.hadoop.ozone.freon.TestRandomKeyGenerator-output.txt
>
>
> Even after the changes done in HDDS-3086, some integration tests (especially
> in it-freon) are intermittently timing out.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]