[jira] Updated: (HBASE-3545) Possible liveness issue with MasterServerAddress in HRegionServer getMaster

Jean-Daniel Cryans (JIRA) Thu, 17 Mar 2011 15:46:55 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-3545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jean-Daniel Cryans updated HBASE-3545:
--------------------------------------

    Fix Version/s: 0.90.2

> Possible liveness issue with MasterServerAddress in HRegionServer getMaster 
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-3545
>                 URL: https://issues.apache.org/jira/browse/HBASE-3545
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.0
>         Environment: 4 Node test cluster
> 2x Hbase master
> 3x Zookeeper nodes
> 4x RS
>            Reporter: Greg Bowyer
>             Fix For: 0.90.2
>
>         Attachments: 
> 0001-Fixed-issue-where-the-region-servers-may-never-conne.patch
>
>
> As part of our evaluation of HBase we have been testing failure scenarios to 
> see how HBase fails in certain situations.
> One of these is the outright failure of a HBase master.
> What presently happens, if a HBase master is shutdown, is that the standby 
> master becomes the active master in the Zookeeper. At the same time the 
> region servers fail to connect to the dead master and typically fail their 
> own heartbeats as part of the reportForDuty() method.
> Following this the region server attempts to get a connection to a working 
> HBase master, inside the getMaster() method the first action is to get the 
> address of a potentially working master server from zookeeper. Following this 
> the code is put into a tight loop whereupon it keeps attempting to connect to 
> the address of the master found in Zookeeper.
> Unfortunately it appears that during master fail-over, it becomes possible to 
> get the address of the old, broken master, this address is then put into the 
> connection attempt loop, whereupon the region server attempts to infinitely 
> connect to the failed, none existent master. At this point nothing is able to 
> break the loop in getMaster so the RS is unable to contact the master.
> At the same time the new master is waiting patiently for the existing region 
> servers as reported in Zookeeper to re-establish contact with it.
> Attached is a patch that rectifies this issue in our test cluster for both 
> the 0.90.0 tag and trunk versions (as of git SHA 
> b72a24f71b67598e4077a9d1452f903082b0a9b7) of HBase.
> This patch is also available in a forked repository here 
> https://github.com/GregBowyer/hbase/commit/543f5903731ef6bbfd58c990e04a2c635e5c94b4

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HBASE-3545) Possible liveness issue with MasterServerAddress in HRegionServer getMaster

Reply via email to