[
https://issues.apache.org/jira/browse/HBASE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nitay Joffe updated HBASE-1629:
-------------------------------
Attachment: hbase-1629.patch
Small patch for a convoluted problem. Amandeep, try this out, see if it fixes
it for you.
Here's the problem:
{noformat}
[14:32] <nitay> reportForDuty()
[14:32] <nitay> while (!getMaster()) {
[14:32] <nitay> sleeper.sleep();
[14:32] <nitay> LOG.warn("Unable to get master for initialization");
[14:32] <nitay> }
[14:33] <nitay> getMaster()
[14:33] <nitay> HServerAddress masterAddress = null;
[14:33] <nitay> while (masterAddress == null) {
[14:33] <nitay> if (stopRequested.get()) {
[14:33] <nitay> return false;
[14:33] <nitay> }
{noformat}
This is an infinite loop which causes the messages at the end of the RS Log
Amandeep posted.
The flow of logic that leads to this is the following:
# RS session with ZooKeeper expires.
# Master gets znode expiration, starts cleanup/shutdown of RS.
# RS gets its session expired, begins restart() logic, setting stopRequested.
# Meanwhile, RS run() thread is still talking to master.
# Master gets a message from RS, but doesn't know it because it's been removed.
This is the "received server report from unknown server..." stuff. Tells the RS
to reinitialize, sending MSG_CALL_SERVER_STARTUP.
# RS on getting MSG_CALL_SERVER_STARTUP calls reportForDuty() and is now in a
loop. The restart() thread from ZooKeeper is waiting for the RS run() to
finish, but it never will.
This simple patch makes reportyForDuty() fail fast when stopRequested is set.
> HRS unable to contact master
> ----------------------------
>
> Key: HBASE-1629
> URL: https://issues.apache.org/jira/browse/HBASE-1629
> Project: Hadoop HBase
> Issue Type: Bug
> Affects Versions: 0.20.0
> Reporter: Amandeep Khurana
> Assignee: Nitay Joffe
> Fix For: 0.20.0
>
> Attachments: hbase-1629.patch, Master_log, RS_Log
>
>
> HRS unable to contact master for initialization after expiration from ZK.
> Master thinks HRS is still up whereas HRS went down and now cannot restart.
> The RS logs have a flurry of the following warning messages:
> 2009-07-08 12:53:19,547 WARN
> org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to get master for
> initialization
> More logs from the RS and the Master attached.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.