[ https://issues.apache.org/jira/browse/HBASE-28690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Duo Zhang resolved HBASE-28690. ------------------------------- Fix Version/s: 2.7.0 3.0.0-beta-2 2.6.1 2.5.11 Hadoop Flags: Reviewed Resolution: Fixed Pushed to all active branches. Thanks [~umesh9414] for contributing! > Aborting Active HMaster is not rejecting reportRegionStateTransition if > procedure is initialised by next Active master > ---------------------------------------------------------------------------------------------------------------------- > > Key: HBASE-28690 > URL: https://issues.apache.org/jira/browse/HBASE-28690 > Project: HBase > Issue Type: Bug > Components: proc-v2 > Affects Versions: 2.5.8 > Reporter: Umesh Kumar Kumawat > Assignee: Umesh Kumar Kumawat > Priority: Major > Labels: pull-request-available > Fix For: 2.7.0, 3.0.0-beta-2, 2.6.1, 2.5.11 > > > A CloseRegionProcedure on master requests the RS to close the region and > after closing the region RS reports RegionStateTransition > back([here|https://github.com/apache/hbase/blob/d1015a68ed9f94d74668abd37edefd32f5e9305b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterRpcServices.java#L1853]). > On receiving the report, the master checks if regionNode has any procedure > assigned to it > ([code|https://github.com/apache/hbase/blob/d1015a68ed9f94d74668abd37edefd32f5e9305b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java#L1294]). > > > {code:java} > private boolean reportTransition(RegionStateNode regionNode, ServerStateNode > serverNode, > TransitionCode state, long seqId, long procId) throws IOException { > ServerName serverName = serverNode.getServerName(); > TransitRegionStateProcedure proc = regionNode.getProcedure(); > if (proc == null) { > return false; > } > > proc.reportTransition(master.getMasterProcedureExecutor().getEnvironment(), > regionNode, > serverName, state, seqId, procId); > return true; > } {code} > If regionNode doesn't have any procedure, the master just logs it and doesn't > throw any error to RPC. > > Think of a case when MasterFailover is happening and the new Active master > only initialized the TRSP and CloseRegionProcedure. Now aborting Master has > stale/false data. If the transition report comes to the aborting master, not > rejecting this report is causing the procedure to get stuck. > > *Logs for more understanding* > active master server4-1 failing > {noformat} > 2024-06-20 04:45:05,576 ERROR > [iority.RWQ.Fifo.write.handler=3,queue=0,port=61000] master.HMaster - ***** > ABORTING master server4-1,61000,1715413775736: Failed to record region server > as started *****{noformat} > *logs of new active master server5-1* > > {noformat} > 2024-06-20 04:49:28,893 DEBUG [aster/server5-1:61000:becomeActiveMaster] > assignment.RegionStateStore - Load hbase:meta entry > region=888a715d5926adbb89c985d8967f40d4, regionState=OPEN, > lastHost=server1-119,61020,1717560166420, > regionLocation=server1-119,61020,1717560166420, openSeqNum=34892620 > 024-06-20 04:49:51,886 INFO [PEWorker-22] procedure2.ProcedureExecutor - > Initialized subprocedures=[{pid=16276416, ppid=16276108, > state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure > table=RIMBS.UPLOADER_JOB_DETAILS, region=888a715d5926adbb89c985d8967f40d4, > UNASSIGN}] (on server5-1) > 2024-06-20 04:49:52,022 INFO [PEWorker-40] procedure2.ProcedureExecutor - > Initialized subprocedures=[{pid=16276470, ppid=16276416, state=RUNNABLE; > CloseRegionProcedure 888a715d5926adbb89c985d8967f40d4, > server=server1-119,61020,1717560166420}] (on server5-1){noformat} > > *RS logs for closing* > {noformat} > 2024-06-20 04:49:52,267 INFO [_REGION-regionserver/server1-119:61020-2] > handler.UnassignRegionHandler - Close 888a715d5926adbb89c985d8967f40d4 > 2024-06-20 04:49:52,267 DEBUG [_REGION-regionserver/server1-119:61020-2] > regionserver.HRegion - Closing 888a715d5926adbb89c985d8967f40d4, disabling > compactions & flushes > 2024-06-20 04:49:52,354 INFO [_REGION-regionserver/server1-119:61020-2] > regionserver.HRegion - Closed > TABLE,KW\x00na240-app1-16\x00/Events-120620231740\x00MARKER-Events,1702619592612.888a715d5926adbb89c985d8967f40d4. > {noformat} > *Logs of report on aborting active Hmaster* > {noformat} > 2024-06-20 04:49:52,355 WARN > [iority.RWQ.Fifo.write.handler=1,queue=0,port=61000] > assignment.AssignmentManager - No matching procedure found for > server1-119,61020,1717560166420 transition on state=OPEN, > location=server1-119,61020,1717560166420, table=RIMBS.UPLOADER_JOB_DETAILS, > region=888a715d5926adbb89c985d8967f40d4 to CLOSED ( host = server4-1 , > hbaseMasterLogFile){noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)