[ 
https://issues.apache.org/jira/browse/HDDS-8343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17773798#comment-17773798
 ] 

Nandakumar commented on HDDS-8343:
----------------------------------

[~ckj], When an offline datanode comes back up with old pipelines that got 
deleted while the datanode was offline will try to start a leader election, but 
the election will fail as the other datanodes will not recognize this pipeline 
(groupID).

This will create a few ERROR/WARN messages in the datanode logs, but this 
doesn't impact the datanode functionality. Eventually, SCM will ask the 
datanode to delete the pipeline, after which the error messages should 
disappear.

Data read doesn't go through Ratis, so this cannot affect any reads in Ozone.
If you see any read failure, then it should be because of some other reason and 
not related to old deleted pipelines.

Can you please share the Ozone Client and the Datanode logs related to the read 
failure?

> Failed to elect leader due to Ratis group not found
> ---------------------------------------------------
>
>                 Key: HDDS-8343
>                 URL: https://issues.apache.org/jira/browse/HDDS-8343
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Kaijie Chen
>            Assignee: Nandakumar
>            Priority: Major
>
> We have encountered some problem during upgrade.
> (The problem may not be directly caused by the upgrade).
> Suppose we have 3 DataNodes forming a ratis group.
> On DN1 and DN2, the pipeline was closed and ratis group has been deleted.
> On DN3, the ratis group has not been deleted, so it's trying to start an 
> election but failed to elect a leader.
> In this situation, we cannot read data from this pipeline.
>  
> Here are the logs on the DN3
> {code:java}
> 2023-03-16 18:39:22,527 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-BBEA4A4FBA6C-FollowerState] INFO 
> org.apache.ratis.server.RaftServer$Division: 
> 207b98d9-ad64-45a8-940f-504b514feff5@group-BBEA4A4FBA6C: changes role from  
> FOLLOWER to CANDIDATE at term 724 for changeToCandidate
> 2023-03-16 18:39:22,527 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-BBEA4A4FBA6C-FollowerState] ERROR 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis:
>  pipeline Action CLOSE on pipeline 
> PipelineID=1b0b8153-71fd-437a-b486-bbea4a4fba6c.Reason : 
> 207b98d9-ad64-45a8-940f-504b514feff5 is in candidate state for 322168ms
> 2023-03-16 18:39:22,527 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-BBEA4A4FBA6C-FollowerState] INFO 
> org.apache.ratis.server.RaftServerConfigKeys: 
> raft.server.leaderelection.pre-vote = false (custom)
> 2023-03-16 18:39:22,527 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-BBEA4A4FBA6C-FollowerState] INFO 
> org.apache.ratis.server.impl.RoleInfo: 207b98d9-ad64-45a8-940f-504b514feff5: 
> start 
> 207b98d9-ad64-45a8-940f-504b514feff5@group-BBEA4A4FBA6C-LeaderElection310
> 2023-03-16 18:39:22,554 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-BBEA4A4FBA6C-LeaderElection310] 
> INFO org.apache.ratis.server.impl.LeaderElection: 
> 207b98d9-ad64-45a8-940f-504b514feff5@group-BBEA4A4FBA6C-LeaderElection310 
> ELECTION round 0: submit vote requests at term 725 for 0: 
> peers:[f8d9ccf6-20c6-4dfa-8a49-012f43a1b27e|rpc:9.179.142.251:9858|dataStream:|priority:0|startupRole:FOLLOWER,
>  
> 33b49c34-caa2-4b4f-894e-dce7db4f97b9|rpc:9.180.20.222:9858|dataStream:|priority:1|startupRole:FOLLOWER,
>  
> 207b98d9-ad64-45a8-940f-504b514feff5|rpc:9.180.21.88:9858|dataStream:|priority:0|startupRole:FOLLOWER]|listeners:[],
>  old=null
> 2023-03-16 18:39:22,554 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-BBEA4A4FBA6C-LeaderElection310] 
> INFO org.apache.ratis.server.RaftServerConfigKeys: 
> raft.server.rpc.first-election.timeout.min = 5s (fallback to 
> raft.server.rpc.timeout.min)
> 2023-03-16 18:39:22,554 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-BBEA4A4FBA6C-LeaderElection310] 
> INFO org.apache.ratis.server.RaftServerConfigKeys: 
> raft.server.rpc.first-election.timeout.max = 5200ms (fallback to 
> raft.server.rpc.timeout.max)
> 2023-03-16 18:39:22,554 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-BBEA4A4FBA6C-LeaderElection310] 
> INFO org.apache.ratis.server.impl.LeaderElection: 
> 207b98d9-ad64-45a8-940f-504b514feff5@group-BBEA4A4FBA6C-LeaderElection310 got 
> exception when requesting votes: java.util.concurrent.ExecutionException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
> 2023-03-16 18:39:22,556 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-BBEA4A4FBA6C-LeaderElection310] 
> INFO org.apache.ratis.server.impl.LeaderElection: 
> 207b98d9-ad64-45a8-940f-504b514feff5@group-BBEA4A4FBA6C-LeaderElection310 got 
> exception when requesting votes: java.util.concurrent.ExecutionException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: INTERNAL: 
> f8d9ccf6-20c6-4dfa-8a49-012f43a1b27e: group-BBEA4A4FBA6C not found.
> 2023-03-16 18:39:22,556 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-BBEA4A4FBA6C-LeaderElection310] 
> INFO org.apache.ratis.server.impl.LeaderElection: 
> 207b98d9-ad64-45a8-940f-504b514feff5@group-BBEA4A4FBA6C-LeaderElection310: 
> ELECTION REJECTED received 0 response(s) and 2 exception(s):
> 2023-03-16 18:39:22,556 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-BBEA4A4FBA6C-LeaderElection310] 
> INFO org.apache.ratis.server.impl.LeaderElection:   Exception 0: 
> java.util.concurrent.ExecutionException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
> 2023-03-16 18:39:22,556 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-BBEA4A4FBA6C-LeaderElection310] 
> INFO org.apache.ratis.server.impl.LeaderElection:   Exception 1: 
> java.util.concurrent.ExecutionException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: INTERNAL: 
> f8d9ccf6-20c6-4dfa-8a49-012f43a1b27e: group-BBEA4A4FBA6C not found.
> 2023-03-16 18:39:22,556 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-BBEA4A4FBA6C-LeaderElection310] 
> INFO org.apache.ratis.server.impl.LeaderElection: 
> 207b98d9-ad64-45a8-940f-504b514feff5@group-BBEA4A4FBA6C-LeaderElection310 
> ELECTION round 0: result REJECTED
> 2023-03-16 18:39:22,557 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-BBEA4A4FBA6C-LeaderElection310] 
> INFO org.apache.ratis.server.RaftServer$Division: 
> 207b98d9-ad64-45a8-940f-504b514feff5@group-BBEA4A4FBA6C: changes role from 
> CANDIDATE to FOLLOWER at term 725 for REJECTED
> 2023-03-16 18:39:22,557 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-BBEA4A4FBA6C-LeaderElection310] 
> INFO org.apache.ratis.server.impl.RoleInfo: 
> 207b98d9-ad64-45a8-940f-504b514feff5: shutdown 
> 207b98d9-ad64-45a8-940f-504b514feff5@group-BBEA4A4FBA6C-LeaderElection310
> 2023-03-16 18:39:22,557 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-BBEA4A4FBA6C-LeaderElection310] 
> INFO org.apache.ratis.server.impl.RoleInfo: 
> 207b98d9-ad64-45a8-940f-504b514feff5: start 
> 207b98d9-ad64-45a8-940f-504b514feff5@group-BBEA4A4FBA6C-FollowerState
> 2023-03-16 18:39:22,557 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-BBEA4A4FBA6C-FollowerState] INFO 
> org.apache.ratis.server.RaftServerConfigKeys: 
> raft.server.rpc.first-election.timeout.min = 5s (fallback to 
> raft.server.rpc.timeout.min)
> 2023-03-16 18:39:22,557 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-BBEA4A4FBA6C-FollowerState] INFO 
> org.apache.ratis.server.RaftServerConfigKeys: 
> raft.server.rpc.first-election.timeout.max = 5200ms (fallback to 
> raft.server.rpc.timeout.max)
> 2023-03-16 18:39:25,688 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-C8992D1FB40D-FollowerState] INFO 
> org.apache.ratis.server.impl.FollowerState: 
> 207b98d9-ad64-45a8-940f-504b514feff5@group-C8992D1FB40D-FollowerState: change 
> to CANDIDATE, lastRpcElapsedTime:5189254060ns, electionTimeout:5189ms
> 2023-03-16 18:39:25,688 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-C8992D1FB40D-FollowerState] INFO 
> org.apache.ratis.server.impl.RoleInfo: 207b98d9-ad64-45a8-940f-504b514feff5: 
> shutdown 207b98d9-ad64-45a8-940f-504b514feff5@group-C8992D1FB40D-FollowerState
> 2023-03-16 18:39:25,688 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-C8992D1FB40D-FollowerState] INFO 
> org.apache.ratis.server.RaftServer$Division: 
> 207b98d9-ad64-45a8-940f-504b514feff5@group-C8992D1FB40D: changes role from  
> FOLLOWER to CANDIDATE at term 706 for changeToCandidate
> 2023-03-16 18:39:25,688 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-C8992D1FB40D-FollowerState] ERROR 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis:
>  pipeline Action CLOSE on pipeline 
> PipelineID=4a5ed735-c797-45e8-a8e5-c8992d1fb40d.Reason : 
> 207b98d9-ad64-45a8-940f-504b514feff5 is in candidate state for 325322ms
> 2023-03-16 18:39:25,688 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-C8992D1FB40D-FollowerState] INFO 
> org.apache.ratis.server.RaftServerConfigKeys: 
> raft.server.leaderelection.pre-vote = false (custom) 2023-03-16 18:39:25,688 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-C8992D1FB40D-FollowerState] INFO 
> org.apache.ratis.server.impl.RoleInfo: 207b98d9-ad64-45a8-940f-504b514feff5: 
> start 
> 207b98d9-ad64-45a8-940f-504b514feff5@group-C8992D1FB40D-LeaderElection311
> 2023-03-16 18:39:25,723 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-C8992D1FB40D-LeaderElection311] 
> INFO org.apache.ratis.server.impl.LeaderElection: 
> 207b98d9-ad64-45a8-940f-504b514feff5@group-C8992D1FB40D-LeaderElection311 
> ELECTION round 0: submit vote requests at term 707 for 0: 
> peers:[1e40274c-a4bd-4e3d-8479-59f8105ec408|rpc:100.76.18.99:9858|dataStream:|priority:1|startupRole:FOLLOWER,
>  
> 207b98d9-ad64-45a8-940f-504b514feff5|rpc:9.180.21.88:9858|dataStream:|priority:0|startupRole:FOLLOWER,
>  
> bcdf3bd5-7b8e-435d-b3fa-b3e29f0eb307|rpc:9.180.5.41:9858|dataStream:|priority:0|startupRole:FOLLOWER]|listeners:[],
>  old=null
> 2023-03-16 18:39:25,723 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-C8992D1FB40D-LeaderElection311] 
> INFO org.apache.ratis.server.RaftServerConfigKeys: 
> raft.server.rpc.first-election.timeout.min = 5s (fallback to 
> raft.server.rpc.timeout.min)
> 2023-03-16 18:39:25,723 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-C8992D1FB40D-LeaderElection311] 
> INFO org.apache.ratis.server.RaftServerConfigKeys: 
> raft.server.rpc.first-election.timeout.max = 5200ms (fallback to 
> raft.server.rpc.timeout.max)
> 2023-03-16 18:39:25,723 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-C8992D1FB40D-LeaderElection311] 
> INFO org.apache.ratis.server.impl.LeaderElection: 
> 207b98d9-ad64-45a8-940f-504b514feff5@group-C8992D1FB40D-LeaderElection311 got 
> exception when requesting votes: java.util.concurrent.ExecutionException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
> 2023-03-16 18:39:25,723 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-C8992D1FB40D-LeaderElection311] 
> INFO org.apache.ratis.server.impl.LeaderElection: 
> 207b98d9-ad64-45a8-940f-504b514feff5@group-C8992D1FB40D-LeaderElection311 got 
> exception when requesting votes: java.util.concurrent.ExecutionException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
> 2023-03-16 18:39:25,723 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-C8992D1FB40D-LeaderElection311] 
> INFO org.apache.ratis.server.impl.LeaderElection: 
> 207b98d9-ad64-45a8-940f-504b514feff5@group-C8992D1FB40D-LeaderElection311: 
> ELECTION REJECTED received 0 response(s) and 2 exception(s):
> 2023-03-16 18:39:25,723 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-C8992D1FB40D-LeaderElection311] 
> INFO org.apache.ratis.server.impl.LeaderElection:   Exception 0: 
> java.util.concurrent.ExecutionException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
> 2023-03-16 18:39:25,723 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-C8992D1FB40D-LeaderElection311] 
> INFO org.apache.ratis.server.impl.LeaderElection:   Exception 1: 
> java.util.concurrent.ExecutionException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
> 2023-03-16 18:39:25,724 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-C8992D1FB40D-LeaderElection311] 
> INFO org.apache.ratis.server.impl.LeaderElection: 
> 207b98d9-ad64-45a8-940f-504b514feff5@group-C8992D1FB40D-LeaderElection311 
> ELECTION round 0: result REJECTED
> 2023-03-16 18:39:25,724 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-C8992D1FB40D-LeaderElection311] 
> INFO org.apache.ratis.server.RaftServer$Division: 
> 207b98d9-ad64-45a8-940f-504b514feff5@group-C8992D1FB40D: changes role from 
> CANDIDATE to FOLLOWER at term 707 for REJECTED
> 2023-03-16 18:39:25,724 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-C8992D1FB40D-LeaderElection311] 
> INFO org.apache.ratis.server.impl.RoleInfo: 
> 207b98d9-ad64-45a8-940f-504b514feff5: shutdown 
> 207b98d9-ad64-45a8-940f-504b514feff5@group-C8992D1FB40D-LeaderElection311
> 2023-03-16 18:39:25,724 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-C8992D1FB40D-LeaderElection311] 
> INFO org.apache.ratis.server.impl.RoleInfo: 
> 207b98d9-ad64-45a8-940f-504b514feff5: start 
> 207b98d9-ad64-45a8-940f-504b514feff5@group-C8992D1FB40D-FollowerState
> 2023-03-16 18:39:25,724 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-C8992D1FB40D-FollowerState] INFO 
> org.apache.ratis.server.RaftServerConfigKeys: 
> raft.server.rpc.first-election.timeout.min = 5s (fallback to 
> raft.server.rpc.timeout.min)
> 2023-03-16 18:39:25,724 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-C8992D1FB40D-FollowerState] INFO 
> org.apache.ratis.server.RaftServerConfigKeys: 
> raft.server.rpc.first-election.timeout.max = 5200ms (fallback to 
> raft.server.rpc.timeout.max)
> 2023-03-16 18:39:26,439 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-3C37EA22A9C1-FollowerState] INFO 
> org.apache.ratis.server.impl.FollowerState: 
> 207b98d9-ad64-45a8-940f-504b514feff5@group-3C37EA22A9C1-FollowerState: change 
> to CANDIDATE, lastRpcElapsedTime:5018766985ns, electionTimeout:5018ms
> 2023-03-16 18:39:26,439 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-3C37EA22A9C1-FollowerState] INFO 
> org.apache.ratis.server.impl.RoleInfo: 207b98d9-ad64-45a8-940f-504b514feff5: 
> shutdown 207b98d9-ad64-45a8-940f-504b514feff5@group-3C37EA22A9C1-FollowerState
> 2023-03-16 18:39:26,439 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-3C37EA22A9C1-FollowerState] INFO 
> org.apache.ratis.server.RaftServer$Division: 
> 207b98d9-ad64-45a8-940f-504b514feff5@group-3C37EA22A9C1: changes role from  
> FOLLOWER to CANDIDATE at term 706 for changeToCandidate
> 2023-03-16 18:39:26,439 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-3C37EA22A9C1-FollowerState] ERROR 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis:
>  pipeline Action CLOSE on pipeline 
> PipelineID=5d8061d6-1692-4eb4-a604-3c37ea22a9c1.Reason : 
> 207b98d9-ad64-45a8-940f-504b514feff5 is in candidate state for 326082ms
> 2023-03-16 18:39:26,439 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-3C37EA22A9C1-FollowerState] INFO 
> org.apache.ratis.server.RaftServerConfigKeys: 
> raft.server.leaderelection.pre-vote = false (custom)
> 2023-03-16 18:39:26,439 
> [207b98d9-ad64-45a8-940f-504b514feff5@group-3C37EA22A9C1-FollowerState] INFO 
> org.apache.ratis.server.impl.RoleInfo: 207b98d9-ad64-45a8-940f-504b514feff5: 
> start 
> 207b98d9-ad64-45a8-940f-504b514feff5@group-3C37EA22A9C1-LeaderElection312
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to