[
https://issues.apache.org/jira/browse/HBASE-22060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16797421#comment-16797421
]
Bahram Chehrazy commented on HBASE-22060:
-----------------------------------------
{quote}After a deep thought, I think the only way is to introduce a fence when
becoming active master, where we will tell all the existing RSes that a new
master is coming and do not accept or respond to the old master any more,
before doing any operations.
{quote}
Good idea. Looks like such mechanism already exists via the
masterAddressTracker. The only missing part is that rssStub does not get
refreshed by that tracker. It only time gets refreshed is when a request to the
master fails with ServiceException.
> postOpenDeployTasks could send OPENED region transition state to the wrong
> master
> ---------------------------------------------------------------------------------
>
> Key: HBASE-22060
> URL: https://issues.apache.org/jira/browse/HBASE-22060
> Project: HBase
> Issue Type: Bug
> Components: amv2, proc-v2
> Affects Versions: 3.0.0
> Reporter: Bahram Chehrazy
> Assignee: Duo Zhang
> Priority: Blocker
> Fix For: 3.0.0, 2.2.0, 2.3.0
>
>
> As was reported in HBASE-21788, we have repeatedly seen regions getting stuck
> in OPENING after master restarts. Here is one scenario that I've observed
> recently:
>
> 1) There is a region is transit (RIT).
> 2) The active master aborts and begins shutting down.
> 3) The backup master becomes active quickly, finds the RIT, creates
> OpenRegionProcedure and send request to some server.
> 4) The server quickly opens the region and posts OPENED state transition, but
> it uses its cached master instead of the new one.
> 5) The old active master which had not completely shutdown its assignment
> manager yet, notes the OPENED state report and ignores it. Because no
> corresponding procedure can be found.
> 6) The new master waits forever for a response to its OPEN region request.
>
> This happens more often with the meta region because it's small and takes a
> few seconds to open. Below are some related logs:
> *Previous HMaster:*
> 2019-03-14 13:19:16,310 ERROR [PEWorker-1] master.HMaster: ***** ABORTING
> master <master-1>,17000,1552438242232: Shutting down HBase cluster: file
> system not available *****
> 2019-03-14 13:19:16,310 INFO [PEWorker-1] regionserver.HRegionServer: *****
> STOPPING region server '<master-1>,17000,1552438242232' *****
> 2019-03-14 13:20:54,358 WARN
> [RpcServer.priority.FPBQ.Fifo.handler=11,queue=1,port=17000]
> assignment.AssignmentManager: No matching procedure found for rit=OPEN,
> location=*************,17020,1552561955412, table=hbase:meta,
> region=1588230740 transition to OPENED
> 2019-03-14 13:20:55,707 INFO [master/<master-1>:17000]
> assignment.AssignmentManager: Stopping assignment manager
> *New HMaster logs:*
> 2019-03-14 13:19:16,907 INFO [master/<master-2>:17000:becomeActiveMaster]
> master.ActiveMasterManager: Deleting ZNode for
> /HBaseServerZnodeCommonDir/**************/backup-masters/<master-2>,17000,1552438259871
> from backup master directory
> 2019-03-14 13:19:17,031 INFO [master/<master-2>:17000:becomeActiveMaster]
> master.ActiveMasterManager: Registered as active
> master=<master-2>,17000,1552438259871
> 2019-03-14 13:20:52,017 INFO [PEWorker-12] zookeeper.MetaTableLocator:
> Setting hbase:meta (replicaId=0) location in ZooKeeper as
> <server-1>,17020,1552536956826
> 2019-03-14 13:20:52,105 INFO [PEWorker-12] procedure2.ProcedureExecutor:
> Initialized subprocedures=[\{pid=178230, ppid=178229, state=RUNNABLE,
> hasLock=false; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
>
> *HServer logs:*
> 2019-03-14 13:20:52,708 INFO [RS_CLOSE_META-regionserver/<server-1>:17020-0]
> handler.AssignRegionHandler: Open hbase:meta,,1.1588230740
> 2019-03-14 13:20:54,353 INFO [RS_CLOSE_META-regionserver/<server-1>:17020-0]
> regionserver.HRegion: Opened 1588230740; next sequenceid=229166
> 2019-03-14 13:20:54,356 INFO [RS_CLOSE_META-regionserver/<server-1>:17020-0]
> regionserver.HRegionServer: Post open deploy tasks for
> hbase:meta,,1.1588230740
> 2019-03-14 13:20:54,358 INFO [RS_CLOSE_META-regionserver/<server-1>:17020-0]
> handler.AssignRegionHandler: Opened hbase:meta,,1.1588230740
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)