[ 
https://issues.apache.org/jira/browse/HBASE-22060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795565#comment-16795565
 ] 

Sergey Shelukhin commented on HBASE-22060:
------------------------------------------

I think the long term fix should be getting rid of imperative protocol; master 
already receives heartbeats and should always derive region state from those... 
that would also make other aspects more resilient to issues and prevent stuck 
clusters.

Short term killing RS should be ok.
One more thing is to make sure master doesn't delete znode/relinquish primary 
status until the end of shutdown. Master being able to process messages or 
changes state at all after losing primary is a major issue...

It's also possible to add master timestamp to protocol, similar to RS 
timestamp. RS would remember the latest master that talked to it, and include 
the TS in its messages. If the master receives a message with a mismatching 
timestamp it can reject a message with a special error that will cause RS to 
re-read ZK and retry. Although with the possibility of two conflicting masters 
this can still be an issue, unless the ts is globally reliable, e.g. is a 
ZK-based incremented value.

> postOpenDeployTasks could send OPENED region transition state to the wrong 
> master
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-22060
>                 URL: https://issues.apache.org/jira/browse/HBASE-22060
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 3.0.0
>            Reporter: Bahram Chehrazy
>            Assignee: stack
>            Priority: Critical
>
> As was reported in HBASE-21788, we have repeatedly seen regions getting stuck 
> in OPENING after master restarts. Here is one scenario that I've observed 
> recently:
>  
> 1) There is a region is transit (RIT).
> 2) The active master aborts and begins shutting down.
> 3) The backup master becomes active quickly, finds the RIT, creates 
> OpenRegionProcedure and send request to some server.
> 4) The server quickly opens the region and posts OPENED state transition, but 
> it uses its cached master instead of the new one. 
> 5) The old active master which had not completely shutdown its assignment 
> manager yet, notes the OPENED state report and ignores it. Because no 
> corresponding procedure can be found.
> 6) The new master waits forever for a response to its OPEN region request.
>  
> This happens more often with the meta region because it's small and takes a 
> few seconds to open. Below are some related logs:
> *Previous HMaster:*
> 2019-03-14 13:19:16,310 ERROR [PEWorker-1] master.HMaster: ***** ABORTING 
> master <master-1>,17000,1552438242232: Shutting down HBase cluster: file 
> system not available *****
> 2019-03-14 13:19:16,310 INFO [PEWorker-1] regionserver.HRegionServer: ***** 
> STOPPING region server '<master-1>,17000,1552438242232' *****
> 2019-03-14 13:20:54,358 WARN 
> [RpcServer.priority.FPBQ.Fifo.handler=11,queue=1,port=17000] 
> assignment.AssignmentManager: No matching procedure found for rit=OPEN, 
> location=*************,17020,1552561955412, table=hbase:meta, 
> region=1588230740 transition to OPENED
> 2019-03-14 13:20:55,707 INFO [master/<master-1>:17000] 
> assignment.AssignmentManager: Stopping assignment manager
> *New HMaster logs:*
> 2019-03-14 13:19:16,907 INFO [master/<master-2>:17000:becomeActiveMaster] 
> master.ActiveMasterManager: Deleting ZNode for 
> /HBaseServerZnodeCommonDir/**************/backup-masters/<master-2>,17000,1552438259871
>  from backup master directory
> 2019-03-14 13:19:17,031 INFO [master/<master-2>:17000:becomeActiveMaster] 
> master.ActiveMasterManager: Registered as active 
> master=<master-2>,17000,1552438259871
> 2019-03-14 13:20:52,017 INFO [PEWorker-12] zookeeper.MetaTableLocator: 
> Setting hbase:meta (replicaId=0) location in ZooKeeper as 
> <server-1>,17020,1552536956826
> 2019-03-14 13:20:52,105 INFO [PEWorker-12] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[\{pid=178230, ppid=178229, state=RUNNABLE, 
> hasLock=false; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
>  
> *HServer logs:*
> 2019-03-14 13:20:52,708 INFO [RS_CLOSE_META-regionserver/<server-1>:17020-0] 
> handler.AssignRegionHandler: Open hbase:meta,,1.1588230740
> 2019-03-14 13:20:54,353 INFO [RS_CLOSE_META-regionserver/<server-1>:17020-0] 
> regionserver.HRegion: Opened 1588230740; next sequenceid=229166
> 2019-03-14 13:20:54,356 INFO [RS_CLOSE_META-regionserver/<server-1>:17020-0] 
> regionserver.HRegionServer: Post open deploy tasks for 
> hbase:meta,,1.1588230740
> 2019-03-14 13:20:54,358 INFO [RS_CLOSE_META-regionserver/<server-1>:17020-0] 
> handler.AssignRegionHandler: Opened hbase:meta,,1.1588230740
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to