[
https://issues.apache.org/jira/browse/HBASE-22060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795736#comment-16795736
]
Duo Zhang commented on HBASE-22060:
-----------------------------------
{quote}
It's also possible to add master timestamp to protocol, similar to RS
timestamp. RS would remember the latest master that talked to it, and include
the TS in its messages. If the master receives a message with a mismatching
timestamp it can reject a message with a special error that will cause RS to
re-read ZK and retry.
{quote}
I think this is a possible solution. We can include the master start code in
the OpenRegionProcedure/CloseRegionProcedure. And the RS will report this value
back when calling reportRegionTransition, then master could do something if it
finds out that the start code does not match.
Let me have a try.
> postOpenDeployTasks could send OPENED region transition state to the wrong
> master
> ---------------------------------------------------------------------------------
>
> Key: HBASE-22060
> URL: https://issues.apache.org/jira/browse/HBASE-22060
> Project: HBase
> Issue Type: Bug
> Affects Versions: 3.0.0
> Reporter: Bahram Chehrazy
> Priority: Blocker
> Fix For: 3.0.0, 2.2.0, 2.3.0
>
>
> As was reported in HBASE-21788, we have repeatedly seen regions getting stuck
> in OPENING after master restarts. Here is one scenario that I've observed
> recently:
>
> 1) There is a region is transit (RIT).
> 2) The active master aborts and begins shutting down.
> 3) The backup master becomes active quickly, finds the RIT, creates
> OpenRegionProcedure and send request to some server.
> 4) The server quickly opens the region and posts OPENED state transition, but
> it uses its cached master instead of the new one.
> 5) The old active master which had not completely shutdown its assignment
> manager yet, notes the OPENED state report and ignores it. Because no
> corresponding procedure can be found.
> 6) The new master waits forever for a response to its OPEN region request.
>
> This happens more often with the meta region because it's small and takes a
> few seconds to open. Below are some related logs:
> *Previous HMaster:*
> 2019-03-14 13:19:16,310 ERROR [PEWorker-1] master.HMaster: ***** ABORTING
> master <master-1>,17000,1552438242232: Shutting down HBase cluster: file
> system not available *****
> 2019-03-14 13:19:16,310 INFO [PEWorker-1] regionserver.HRegionServer: *****
> STOPPING region server '<master-1>,17000,1552438242232' *****
> 2019-03-14 13:20:54,358 WARN
> [RpcServer.priority.FPBQ.Fifo.handler=11,queue=1,port=17000]
> assignment.AssignmentManager: No matching procedure found for rit=OPEN,
> location=*************,17020,1552561955412, table=hbase:meta,
> region=1588230740 transition to OPENED
> 2019-03-14 13:20:55,707 INFO [master/<master-1>:17000]
> assignment.AssignmentManager: Stopping assignment manager
> *New HMaster logs:*
> 2019-03-14 13:19:16,907 INFO [master/<master-2>:17000:becomeActiveMaster]
> master.ActiveMasterManager: Deleting ZNode for
> /HBaseServerZnodeCommonDir/**************/backup-masters/<master-2>,17000,1552438259871
> from backup master directory
> 2019-03-14 13:19:17,031 INFO [master/<master-2>:17000:becomeActiveMaster]
> master.ActiveMasterManager: Registered as active
> master=<master-2>,17000,1552438259871
> 2019-03-14 13:20:52,017 INFO [PEWorker-12] zookeeper.MetaTableLocator:
> Setting hbase:meta (replicaId=0) location in ZooKeeper as
> <server-1>,17020,1552536956826
> 2019-03-14 13:20:52,105 INFO [PEWorker-12] procedure2.ProcedureExecutor:
> Initialized subprocedures=[\{pid=178230, ppid=178229, state=RUNNABLE,
> hasLock=false; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
>
> *HServer logs:*
> 2019-03-14 13:20:52,708 INFO [RS_CLOSE_META-regionserver/<server-1>:17020-0]
> handler.AssignRegionHandler: Open hbase:meta,,1.1588230740
> 2019-03-14 13:20:54,353 INFO [RS_CLOSE_META-regionserver/<server-1>:17020-0]
> regionserver.HRegion: Opened 1588230740; next sequenceid=229166
> 2019-03-14 13:20:54,356 INFO [RS_CLOSE_META-regionserver/<server-1>:17020-0]
> regionserver.HRegionServer: Post open deploy tasks for
> hbase:meta,,1.1588230740
> 2019-03-14 13:20:54,358 INFO [RS_CLOSE_META-regionserver/<server-1>:17020-0]
> handler.AssignRegionHandler: Opened hbase:meta,,1.1588230740
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)