[ 
https://issues.apache.org/jira/browse/HBASE-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15412951#comment-15412951
 ] 

Ted Yu commented on HBASE-16367:
--------------------------------

Excerpt from the attached master log:
{code}
2016-08-06 08:43:52,391 INFO  [1-7:20000.activeMasterManager] master.HMaster: 
Initializing Master file system
...
2016-08-06 08:43:52,489 INFO  [master/1-7.openstacklocal/0.0.0.18:20000] 
client.ZooKeeperRegistry: ClusterId read in ZooKeeper is null
2016-08-06 08:43:52,489 DEBUG [master/1-7.openstacklocal/0.0.0.18:20000] 
client.ConnectionManager$HConnectionImplementation: clusterid came back null, 
using default default-cluster
2016-08-06 08:43:52,499 DEBUG [master/1-7.openstacklocal/0.0.0.18:20000] 
ipc.AbstractRpcClient: 
Codec=org.apache.hadoop.hbase.codec.KeyValueCodec@4a03cc69, compressor=null, 
tcpKeepAlive=true, tcpNoDelay=true, connectTO=10000, readTO=20000, 
writeTO=60000, minIdleTimeBeforeClose=120000, maxRetries=0, 
fallbackAllowed=false, bind address=null
2016-08-06 08:43:52,512 INFO  [master/1-7.openstacklocal/0.0.0.18:20000] 
regionserver.HRegionServer: STOPPED: Cluster ID has not been set
{code}
HRegionServer#run() got executed before finishActiveMasterInitialization() got 
to setting cluster Id.
The latch allows finishActiveMasterInitialization() to wake up 
HRegionServer#run() when the cluster Id is published.
If the cluster Id is still not available after the wait, region server process 
would shut down (current behavior).
Normally it shouldn't take 50 seconds for finishActiveMasterInitialization() to 
publish cluster Id.

I can add a check against return value of await() being false and a debug log 
so that the findbugs warning is suppressed.


> Race between master and region server initialization may lead to premature 
> server abort
> ---------------------------------------------------------------------------------------
>
>                 Key: HBASE-16367
>                 URL: https://issues.apache.org/jira/browse/HBASE-16367
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 1.1.2
>            Reporter: Ted Yu
>            Assignee: Ted Yu
>             Fix For: 2.0.0, 1.4.0
>
>         Attachments: 16367.addendum, 16367.v1.txt, 16367.v2.txt, 
> 16367.v3.txt, 63908-master.log
>
>
> I was troubleshooting a case where hbase (1.1.2) master always dies shortly 
> after start - see attached master log snippet.
> It turned out that master initialization thread was racing with 
> HRegionServer#preRegistrationInitialization() (initializeZooKeeper, actually) 
> since HMaster extends HRegionServer.
> Through additional logging in master:
> {code}
>     this.oldLogDir = createInitialFileSystemLayout();
>     HFileSystem.addLocationsOrderInterceptor(conf);
>     LOG.info("creating splitLogManager");
> {code}
> I found that execution didn't reach the last log line before region server 
> declared cluster Id being null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to