[ 
https://issues.apache.org/jira/browse/KUDU-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15838197#comment-15838197
 ] 

Dinesh Bhat commented on KUDU-1847:
-----------------------------------

[~bruceSz], were there only 3 nodes in the cluster, which is why perhaps master 
couldn't add the 3rd replica when it went from 3->2 ? You may find logs in the 
master like this "No candidate replacement replica found" if that was true.
Regarding the failed replica never kicked out of raft config, this is actually 
KUDU-1407, and it's a wip at the moment, hoping to fix that in 1.3.

> kudu-tserver should remove itself from raft-peer-config when met tablet 
> corruption
> ----------------------------------------------------------------------------------
>
>                 Key: KUDU-1847
>                 URL: https://issues.apache.org/jira/browse/KUDU-1847
>             Project: Kudu
>          Issue Type: Bug
>          Components: cfile, consensus
>    Affects Versions: 1.0.0
>            Reporter: zhangsong
>            Priority: Critical
>             Fix For: 1.0.0
>
>
> problem found:
> Today  one of my tables became unwritable. From kudu-master , i found there 
> is only one "FOLLOWER" left in raft-config of a tablet. 
> After searching kudu-tserver.LOG i found error logs like this 
> "I0124 03:29:16.000665 17144 raft_consensus.cc:380] T 
> 8870bca7167f46c88099fb3236477530 P 1fa77467172b4ed7ba1a0a10e3dd67f8 [term 
> 173317 FOLLOWER]: Starting election with config: opid_index: 572616 local: 
> false peers { permanent_uuid: "69947ffe22e245afb579287073c58dc2" member_type: 
> VOTER last_known_addr { host: "peer_ip" port: 7050 } } peers { 
> permanent_uuid: "1fa77467172b4ed7ba1a0a10e3dd67f8" member_type: VOTER 
> last_known_addr { host: "localhost" port: 7050 } }
> I0124 03:29:16.001211 17144 leader_election.cc:223] T 
> 8870bca7167f46c88099fb3236477530 P 1fa77467172b4ed7ba1a0a10e3dd67f8 
> [CANDIDATE]: Term 173317 election: Requesting vote from peer 
> 69947ffe22e245afb579287073c58dc2
> W0124 03:29:16.001549 15548 leader_election.cc:281] T 
> 8870bca7167f46c88099fb3236477530 P 1fa77467172b4ed7ba1a0a10e3dd67f8 
> [CANDIDATE]: Term 173317 election: Tablet error from VoteRequest() call to 
> peer 69947ffe22e245afb579287073c58dc2: Illegal state: Tablet not RUNNING: 
> FAILED: Not found: Can't find block: 0000000318394411
> I0124 03:29:16.001845 15548 leader_election.cc:248] T 
> 8870bca7167f46c88099fb3236477530 P 1fa77467172b4ed7ba1a0a10e3dd67f8 
> [CANDIDATE]: Term 173317 election: Election decided. Result: candidate lost.
> "
> This logs indicate that the current follower(f1) of the tablet start leader 
> election( after election timeout ), and found tablet on another follower(f2)  
> is not running (corruption) . So the election failed. 
> at the end only one follower of the tablet is alive.
> I also found the tablet of f2 has been corrupted for a several days.
> Hence i think this  is a bug that we lack logic to remove a peer from  
> RaftConfig  when the tablet's data  of the peer is  corrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to