[ 
https://issues.apache.org/jira/browse/HBASE-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964799#action_12964799
 ] 

Jonathan Gray commented on HBASE-3280:
--------------------------------------

Somewhat related to this, what happened on a cluster here is that the HRS got 
stuck in this loop trying to reconnect to master and ignoring the 
YouAreDeadExceptions.  But then once the master finished shutdown handling, it 
removes this server from the dead server list.  Then the RS actually 
successfully heartbeated in to the master and the master thought it was a legit 
RS (even though it just finished doing a shutdown of it).

Is there a reason we should ever clear things out of the dead server list?  If 
this RS is in a network partition it may not check back with the master for a 
long time so we should always remember the dead serverNames (which include 
start codes)?

> YouAreDeadException being swallowed in HRS getMaster()
> ------------------------------------------------------
>
>                 Key: HBASE-3280
>                 URL: https://issues.apache.org/jira/browse/HBASE-3280
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.0
>            Reporter: Jonathan Gray
>            Assignee: Jonathan Gray
>             Fix For: 0.90.0, 0.92.0
>
>
> In the HRS, when we lose our connection to the master, we enter into a loop 
> where we keep trying to get the new master location in ZK and attempt to send 
> our heartbeat.  Within tryRegionServerReport() we could get a 
> YouAreDeadException, but we won't let it out.  This leads to the RS 
> continuously heartbeating in to the master although the master keeps telling 
> it to kill itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to