[
https://issues.apache.org/jira/browse/HBASE-29364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17956019#comment-17956019
]
chaijunjie commented on HBASE-29364:
------------------------------------
did you test in a new version? or try to backport
https://issues.apache.org/jira/browse/HBASE-28180 to test again?
> Region will be opened in unknown regionserver when master is changed & rs
> crashed
> ---------------------------------------------------------------------------------
>
> Key: HBASE-29364
> URL: https://issues.apache.org/jira/browse/HBASE-29364
> Project: HBase
> Issue Type: Bug
> Components: Region Assignment
> Affects Versions: 2.3.0
> Reporter: Zhiwen Deng
> Priority: Major
>
> We have encountered multiple cases where regions were opened on RegionServers
> (RS) that had already been offlined. It wasn't until recently that we
> discovered a potential cause for this issue, and the details of the problem
> are as follows:
> Our HDFS storage reached the online level, which caused the upper-level
> master and rs to be unable to write and abort. Finally, we manually accessed
> and deleted some data, and HDFS was restored. Then the hbck report showed
> that some regions were opened on the offline rs, which caused these regions
> to be unable to server. We finally used hbck2 to assigns these regions, and
> the problem was solved.
> h3. # The Problem
> Here is the analysis of the region transition for one specific region:
> 19f709990ad65ce3d51ddeaf29acf436:
> 2025-05-21, 05:48:11 : The region was assigned to
> {+}rs-hostname,20700,1747777624803{+}, but due to some anomalies, it could
> not be opened on the target RS. Finally, the RS reported the open result to
> the Master:
> {code:java}
> 2025-05-21,05:48:11,646 INFO
> [RpcServer.priority.RWQ.Fifo.write.handler=2,queue=0,port=20600]
> org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase: Received
> report from rs-hostname,20700,1747777624803, transitionCode=FAILED_OPEN,
> seqId=-1, regionNode=state=OPENING, location=rs-hostname,20700,1747777624803,
> table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436, proc=pid=78499,
> ppid=78034, state=RUNNABLE;
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure {code}
> 2025-05-21, 05:48:27 : rs-hostname,20700,1747777624803 went offline.
> {code:java}
> 2025-05-21, 05:48:27,981 INFO [KeepAlivePEWorker-65]
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=78671,
> state=SUCCESS; ServerCrashProcedure server=rs-hostname,20700,1747777624803,
> splitWal=true, meta=false in 16.0720 sec {code}
> 2025-05-21, 05:49:12,312 : The RS hosting the hbase:meta table also
> encountered a failure, making the meta table unavailable, which caused the
> above region to get stuck in the RIT (Region-In-Transition) process.
> {code:java}
> 2025-05-21, 05:49:12,312 WARN [ProcExecTimeout]
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK
> Region-In-Transition state=OPENING, location=rs-hostname,20700,1747777624803,
> table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436 {code}
> Due to the HDFS failure, the Master also performed an abort action. The new
> active Master continued to execute the previously incomplete procedure.
> {code:java}
> 2025-05-21, 06:02:38,423 INFO
> [master/master-hostname:20600:becomeActiveMaster]
> org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock
> for pid=78034, ppid=77973,
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED;
> TransitRegionStateProcedure table=test:xxx,
> region=19f709990ad65ce3d51ddeaf29acf436, ASSIGN
> 2025-05-21, 06:02:38,572 INFO [master/master-hostname:becomeActiveMaster]
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: Attach
> pid=78034, ppid=77973, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED;
> TransitRegionStateProcedure table=test:xxx,
> region=19f709990ad65ce3d51ddeaf29acf436, ASSIGN to state=OFFLINE,
> location=null, table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436 to
> restore RIT {code}
> When the Master switches, it modifies the region's state in the meta table
> based on the procedure's status
> {code:java}
> 2025-05-21, 06:07:52,433 INFO
> [master/master-hostname:20600:becomeActiveMaster]
> org.apache.hadoop.hbase.master.assignment.RegionStateStore: Load hbase:meta
> entry region=19f709990ad65ce3d51ddeaf29acf436, regionState=OPENING,
> lastHost=rs-hostname-last,20700,1747776391310,
> regionLocation=rs-hostname,20700,1747777624803, openSeqNum=174628702
> 2025-05-21, 06:07:52,433 WARN
> [master/master-hostname:20600:becomeActiveMaster]
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure: Received
> report OPENED transition from rs-hostname,20700,1747777624803 for
> state=OPENING, location=rs-hostname,20700,1747777624803, table=test:xxx,
> region=19f709990ad65ce3d51ddeaf29acf436, pid=78499 but the new openSeqNum -1
> is less than the current one 174628702, ignoring...
> {code}
> I reviewed the relevant code and found that at this point, the region's state
> in the Master's memory is changed to OPENED, and with the state transition of
> RegionRemoteProcedureBase, it is persisted in the meta table.
> {code:java}
> void stateLoaded(AssignmentManager am, RegionStateNode regionNode){
> if (state ==
> RegionRemoteProcedureBaseState.REGION_REMOTE_PROCEDURE_REPORT_SUCCEED) {
> try {
> restoreSucceedState(am, regionNode, seqId);
> } catch (IOException e) {
> // should not happen as we are just restoring the state
> throw new AssertionError(e);
> }
> }
> }
> @Override
> protected void restoreSucceedState(AssignmentManager am, RegionStateNode
> regionNode,
> long openSeqNum) throws IOException {
> if (regionNode.getState() == State.OPEN) {
> // should have already been persisted, ignore
> return;
> }
> regionOpenedWithoutPersistingToMeta(am, regionNode, TransitionCode.OPENED,
> openSeqNum);
> }{code}
> Therefore, a failed open state was persisted to the meta table as OPEN, and
> because the RS had already completed the SCP, its region would not be
> processed again.
> {code:java}
> 2025-05-21, 06:07:53,138 INFO [PEWorker-56]
> org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=78034
> updating hbase:meta row=19f709990ad65ce3d51ddeaf29acf436, regionState=OPEN,
> repBarrier=174628702, openSeqNum=174628702,
> regionLocation=rs-hostname,20700,1747777624803 {code}
> h3. # How to fix
> We can refer to the logic when the Master does not switch, where if the open
> fails, the region's state is not modified, thereby preventing the above
> process from occurring.
> {code:java}
> @Override
> protected void updateTransitionWithoutPersistingToMeta(MasterProcedureEnv env,
> RegionStateNode regionNode, TransitionCode transitionCode, long openSeqNum)
> throws IOException {
> if (transitionCode == TransitionCode.OPENED) {
> regionOpenedWithoutPersistingToMeta(env.getAssignmentManager(),
> regionNode, transitionCode,
> openSeqNum);
> } else {
> assert transitionCode == TransitionCode.FAILED_OPEN;
> // will not persist to meta if giveUp is false
> env.getAssignmentManager().regionFailedOpen(regionNode, false);
> }
> } {code}
>
> So, we only need to add a check in restoreSucceedState:
> {code:java}
> @Override
> protected void restoreSucceedState(AssignmentManager am, RegionStateNode
> regionNode,
> long openSeqNum) throws IOException {
> if (regionNode.getState() == State.OPEN) {
> // should have already been persisted, ignore
> return;
> }
> regionOpenedWithoutPersistingToMeta(am, regionNode, TransitionCode.OPENED,
> openSeqNum);
> if (super.transitionCode == TransitionCode.OPENED) {
> regionOpenedWithoutPersistingToMeta(am, regionNode,
> TransitionCode.OPENED, openSeqNum);
> }
> // if status is not OPENED, will not change regionState.
> // otherwise region may be opened on an expired region server.
> } {code}
> I want to know whether my fix is OK, if so I will submit a patch to fix it.
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)