[
https://issues.apache.org/jira/browse/HBASE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144036#comment-13144036
]
Ted Yu commented on HBASE-4749:
-------------------------------
Thanks for the finding Jinchao.
>From log of build 105:
{code}
Killing RS juno.apache.org,60001,1320357166142
2011-11-03 21:52:56,007 FATAL [Thread-986] regionserver.HRegionServer(1523):
ABORTING region server juno.apache.org,60001,1320357166142: Killing for unit
test
...
2011-11-03 21:52:56,011 WARN [Thread-986] regionserver.HRegionServer(1545):
Unable to report fatal error to master
java.lang.reflect.UndeclaredThrowableException
at $Proxy16.reportRSFatalError(Unknown Source)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:1541)
...
2011-11-03 21:52:57,356 INFO [Master:0;juno.apache.org,51313,1320357176029]
master.HMaster(464): Registering server found up in zk:
juno.apache.org,60001,1320357166142
2011-11-03 21:52:57,357 INFO [Master:0;juno.apache.org,51313,1320357176029]
master.ServerManager(239): Registering
server=juno.apache.org,60001,1320357166142
...
2011-11-03 21:52:57,586 INFO [Thread-986-EventThread]
zookeeper.RegionServerTracker(93): RegionServer ephemeral node deleted,
processing expiration [juno.apache.org,60001,1320357166142]
2011-11-03 21:52:57,588 INFO
[RegionServer:1;juno.apache.org,60001,1320357166142]
regionserver.HRegionServer(744): stopping server
juno.apache.org,60001,1320357166142; zookeeper connection closed.
{code}
We can see that there was 570ms delay for the completion of region server
shutdown handler. That was why re-registration of the dead region server
happened.
Since reportRSFatalError() encountered exception, we cannot rely on this
callback to reach master.
We have two options:
1. devise a mechanism to tell the new master the identity of the dead region
server
2. insert a sleep of say 1 second before starting the new master
Option 1 introduces extra complexity into Master. I am not sure if it is worth
it just for test purposes.
Many people wouldn't like option 2.
More discussion is welcome.
> TestMasterFailover case occasional fails
> ----------------------------------------
>
> Key: HBASE-4749
> URL: https://issues.apache.org/jira/browse/HBASE-4749
> Project: HBase
> Issue Type: Bug
> Components: test
> Affects Versions: 0.92.0
> Reporter: gaojinchao
> Priority: Minor
> Fix For: 0.92.0
>
>
> look this logs:
> https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/105/testReport/org.apache.hadoop.hbase.master/TestMasterFailover/testMasterFailoverWithMockedRITOnDeadRS/
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira