[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658335#comment-16658335 ]
Ankit Singhal commented on HBASE-21344: --------------------------------------- Thanks [~stack] for taking a look. PFB , my responses {quote}So, you are trying to figure the case where the assign in IMP failed to succeed – where the region is stuck in the OPENING state – and if you can find this condition, you'd reschedule an IMP (the body of which happens to be an assign of meta)? {quote} Here actually, we are correcting the rollback of assign procedure for Meta in both IMP and SCP. We are not re-scheduling the IMP until the master restarts(or standby become active) and it finds that meta is still not OPEN. Earlier, rollback of assign corrects the meta region node(by moving it to offline state) but for Meta, we also store state in another meta znode(/hbase/meta-region-server), which was not cleared or set back to offline. (patch is trying to fix this) {quote}What you think of the discussion over in HBASE-21035 where we decide to punt on auto-assign for now at least (IMP only assigns, doesn't do recovery of meta WALs if any). {quote} HBASE-21035 looks quite similar but it seems that the handling is more related to the case where the procedures WALs is accidentally/ intentionally cleared. In our case, splitting of meta logs is completed but assign in SCP failed, which left the meta region node in the OPENING state still after rollback, now meta is never assigned(even after restart) resulting in SCP to never kick-in or cluster to get stuck. So to fix this we are doing two things:- * fixing the meta regionserver node(back to offline state) during the rollback(undoRegionOpening) * During master initialization, checking meta assignment, if it's still not open, we are scheduling another IMP for assignment and waiting on it for completion (Do you think scheduling IMP without checking whether meta logs were split or not, will cause in any problem?) > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --------------------------------------------------------------------------------------------------------------------------------------- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 > Reporter: Ankit Singhal > Assignee: Ankit Singhal > Priority: Major > Attachments: HBASE-21344-branch-2.0.patch > > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 > checking lock on 1588230740 > {code} > {code:java} > 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=47, > state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; ServerCrashProcedure > server=<ip-address>,16020,1538974612843, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_GET_REGIONS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981) > {code} > {code:java} > { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call exception, tries=7, > retries=7, started=8168 ms ago, cancelled=false, msg=Meta region is in state > OPENING, details=row 'backup:system' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, hostname=<hostname>, seqNum=-1, > exception=java.io.IOException: Meta region is in state OPENING > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$null$1(ZKAsyncRegistry.java:154) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$getAndConvert$0(ZKAsyncRegistry.java:77) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) > at > org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient$ZKTask$1.exec(ReadOnlyZKClient.java:165) > at > org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient.run(ReadOnlyZKClient.java:323) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)