[ 
https://issues.apache.org/jira/browse/HBASE-14664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991456#comment-14991456
 ] 

Samir Ahmic commented on HBASE-14664:
-------------------------------------

Thanks for review [~stack],
regarding unit test what do you have on mind? 
ActiveMasterManager#handleMasterNodeChange() is covered in 
TestActiveMasterManager and is also tested in TestMasterFailover as part of 
failover process. 
Should we create test covering this scenario killing and restarting master to 
reproduce issue or focus on aftermath of removing meta-region-znode ?  


> Master failover issue: Backup master is unable to start if active master is 
> killed and started in short time interval
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-14664
>                 URL: https://issues.apache.org/jira/browse/HBASE-14664
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 2.0.0
>            Reporter: Samir Ahmic
>            Assignee: Samir Ahmic
>            Priority: Critical
>             Fix For: 2.0.0
>
>         Attachments: HBASE-14664.patch, HBASE-14664.patch
>
>
> I notice this issue while running IntegrationTestDDLMasterFailover, it can be 
> simply reproduced by executing this on active master (tested on two masters + 
> 3rs cluster setup)
> {code}
> $ kill -9 master_pid; hbase-daemon.sh  start master
> {code} 
> Logs show that new active master is trying to locate hbase:meta table on 
> restarted active master
> {code}
> 2015-10-21 19:28:20,804 INFO  [hnode2:16000.activeMasterManager] 
> zookeeper.MetaTableLocator: Failed verification of hbase:meta,,1 at 
> address=hnode1,16000,1445447051681, 
> exception=org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is 
> not running yet
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:1092)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1330)
>         at 
> org.apache.hadoop.hbase.master.MasterRpcServices.getRegionInfo(MasterRpcServices.java:1525)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22233)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2136)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:106)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
>         at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
>         at java.lang.Thread.run(Thread.java:745)
> 2015-10-21 19:28:20,805 INFO  [hnode2:16000.activeMasterManager] 
> master.HMaster: Meta was in transition on hnode1,16000,1445447051681
> 2015-10-21 19:28:20,805 INFO  [hnode2:16000.activeMasterManager] 
> master.AssignmentManager: Processing {1588230740 state=OPEN, 
> ts=1445448500598, server=hnode1,16000,1445447051681
> {code}
>  and because of above master is unable to read hbase:meta table:
> {code}
> 2015-10-21 19:28:49,429 INFO  [hconnection-0x6e9cebcc-shared--pool6-t1] 
> client.AsyncProcess: #2, table=hbase:meta, attempt=10/351 failed=1ops, last 
> exception: org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: 
> org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is not 
> running yet
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:1092)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2083)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32462)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2136)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:106)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
>         at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> which cause master is unable to complete start. 
> I have also notices that in this case value of /hbase/meta-region-server 
> znode is always pointing on restarted active master (hnode1 in my cluster ).
> I was able to workaround this issue by repeating same scenario with following:
> {code}
> $ kill -9 master_pid; hbase zkcli rmr /hbase/meta-region-server; 
> hbase-daemon.sh start master
> {code}
> So issue is probably caused by staled value in /hbase/meta-region-server 
> znode. I will try to create patch based on above.   
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to