[ 
https://issues.apache.org/jira/browse/HDDS-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083949#comment-17083949
 ] 

Marton Elek commented on HDDS-3257:
-----------------------------------

We analyzed a few with [~shashikant] and [~ljain]. It seems to be an asymmetric 
communication problem between leader and follower:

Usually the last leader is elected with 1 vote (instead of 2):

{code}
2020-04-07 05:39:52,800 
[10efa0c0-b6d1-45a1-855e-ef1ad741a71c@group-9C1A66D474A0-LeaderElection7] INFO  
impl.LeaderElection (LeaderElection.java:logAndReturn(61)) - 
10efa0c0-b6d1-45a1-855e-ef1ad741a71c@group-9C1A66D474A0-LeaderElection7: 
Election PASSED; received 1 response(s) 
[10efa0c0-b6d1-45a1-855e-ef1ad741a71c<-bd7acc78-d58c-493d-9f90-aa900375b793#0:OK-t2]
 and 0 exception(s); 
10efa0c0-b6d1-45a1-855e-ef1ad741a71c@group-9C1A66D474A0:t2, leader=null, 
voted=10efa0c0-b6d1-45a1-855e-ef1ad741a71c, 
raftlog=10efa0c0-b6d1-45a1-855e-ef1ad741a71c@group-9C1A66D474A0-SegmentedRaftLog:OPENED:c-1,f-1,i0,
 conf=-1: [d6a790ea-9667-4c35-b496-e28617be47e4:172.17.0.2:41539, 
10efa0c0-b6d1-45a1-855e-ef1ad741a71c:172.17.0.2:45457, 
bd7acc78-d58c-493d-9f90-aa900375b793:172.17.0.2:36205], old=null
{code}

But the follower which didn't vote, receive the message from the leader:

{code}
2020-04-07 05:39:52,861 [grpc-default-executor-0] INFO  impl.RaftServerImpl 
(ServerState.java:setLeader(255)) - 
d6a790ea-9667-4c35-b496-e28617be47e4@group-9C1A66D474A0: change Leader from 
null to 10efa0c0-b6d1-45a1-855e-ef1ad741a71c at term 2 for appendEntries, 
leader elected after 10407ms
{code}

And after the 1 minute timeout multiple append log entries are timing out:

{code}
2020-04-07 05:41:05,247 
[java.util.concurrent.ThreadPoolExecutor$Worker@228750d3[State = -1, empty 
queue]] WARN  server.GrpcLogAppender 
(GrpcLogAppender.java:timeoutAppendRequest(212)) - 
10efa0c0-b6d1-45a1-855e-ef1ad741a71c@group-9C
1A66D474A0->d6a790ea-9667-4c35-b496-e28617be47e4-GrpcLogAppender:  
appendEntries Timeout, 
request=AppendEntriesRequest:cid=39,entriesCount=1,lastEntry=(t:2, i:34)
{code}

> Intermittent timeout in integration tests
> -----------------------------------------
>
>                 Key: HDDS-3257
>                 URL: https://issues.apache.org/jira/browse/HDDS-3257
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>            Reporter: Attila Doroszlai
>            Assignee: Shashikant Banerjee
>            Priority: Critical
>         Attachments: 
> org.apache.hadoop.fs.ozone.contract.ITestOzoneContractMkdir-output.txt, 
> org.apache.hadoop.ozone.client.rpc.TestBlockOutputStreamWithFailures-output.txt,
>  org.apache.hadoop.ozone.freon.TestOzoneClientKeyGenerator-output.txt, 
> org.apache.hadoop.ozone.freon.TestRandomKeyGenerator-output.txt
>
>
> Even after the changes done in HDDS-3086, some integration tests (especially 
> in it-freon) are intermittently timing out.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to