[ 
https://issues.apache.org/jira/browse/HBASE-3545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12996152#comment-12996152
 ] 

Hudson commented on HBASE-3545:
-------------------------------

Integrated in HBase-TRUNK #1746 (See 
[https://hudson.apache.org/hudson/job/HBase-TRUNK/1746/])
    HBASE-3545 Possible liveness issue with MasterServerAddress in 
HRegionServer getMaster


> Possible liveness issue with MasterServerAddress in HRegionServer getMaster 
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-3545
>                 URL: https://issues.apache.org/jira/browse/HBASE-3545
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.0
>         Environment: 4 Node test cluster
> 2x Hbase master
> 3x Zookeeper nodes
> 4x RS
>            Reporter: Greg Bowyer
>         Attachments: 
> 0001-Fixed-issue-where-the-region-servers-may-never-conne.patch
>
>
> As part of our evaluation of HBase we have been testing failure scenarios to 
> see how HBase fails in certain situations.
> One of these is the outright failure of a HBase master.
> What presently happens, if a HBase master is shutdown, is that the standby 
> master becomes the active master in the Zookeeper. At the same time the 
> region servers fail to connect to the dead master and typically fail their 
> own heartbeats as part of the reportForDuty() method.
> Following this the region server attempts to get a connection to a working 
> HBase master, inside the getMaster() method the first action is to get the 
> address of a potentially working master server from zookeeper. Following this 
> the code is put into a tight loop whereupon it keeps attempting to connect to 
> the address of the master found in Zookeeper.
> Unfortunately it appears that during master fail-over, it becomes possible to 
> get the address of the old, broken master, this address is then put into the 
> connection attempt loop, whereupon the region server attempts to infinitely 
> connect to the failed, none existent master. At this point nothing is able to 
> break the loop in getMaster so the RS is unable to contact the master.
> At the same time the new master is waiting patiently for the existing region 
> servers as reported in Zookeeper to re-establish contact with it.
> Attached is a patch that rectifies this issue in our test cluster for both 
> the 0.90.0 tag and trunk versions (as of git SHA 
> b72a24f71b67598e4077a9d1452f903082b0a9b7) of HBase.
> This patch is also available in a forked repository here 
> https://github.com/GregBowyer/hbase/commit/543f5903731ef6bbfd58c990e04a2c635e5c94b4

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to