[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever

Ankit Singhal (JIRA) Tue, 23 Oct 2018 15:30:27 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16661395#comment-16661395
 ]


Ankit Singhal commented on HBASE-21344:
---------------------------------------

bq. This should be happening already. We wait on meta assign. If SCPs, they'll 
run and recover meta if one of them was holding it. If no assign for meta in 
the procedure store, then something untoward and at least for now, operator 
needs to figure what happened until we fix the bug. Operator can schedule an 
assign with hbck2....
bq. branch-2.0 will go into a holding pattern if hbase:meta is not assigned 
(ditto if hbase:namespace is not assigned) waiting on operator intevention to 
clear the lack-of-assign.
Thanks [~stack] for the pointer, I didn't go down as the problem was started 
when we are starting tableStateManager without waiting for meta assignment by 
SCPs. I think we can just remove this from here as we already starting after 
waiting for meta to get online.(attached patch for the same)
{code}
     if (initMetaProc != null) {
       initMetaProc.await();
     }
-    tableStateManager.start();
{code}

bq. That said, I see some value in this patch. In particular the bit around 
resetting hbase:meta state if failure.
We shouldn't offline the meta if we are failing the assignment as it will start 
the InitMetaProcedure (which we don't want as SCP need to take care of 
recovering of Meta).



> hbase:meta location in ZooKeeper set to OPENING by the procedure which 
> eventually failed but precludes Master from assigning it forever
> ---------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-21344
>                 URL: https://issues.apache.org/jira/browse/HBASE-21344
>             Project: HBase
>          Issue Type: Bug
>          Components: proc-v2
>            Reporter: Ankit Singhal
>            Assignee: Ankit Singhal
>            Priority: Major
>         Attachments: HBASE-21344-branch-2.0.patch, 
> HBASE-21344-branch-2.0_v2.patch
>
>
> [~elserj] has already summarized it well.
> 1. hbase:meta was on RS8
> 2. RS8 crashed, SCP was queued for it, meta first
> 3. meta was marked OFFLINE
> 4. meta marked as OPENING on RS3
> 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue
> 6. We attempt the openRegion/assignment 10 times, failing each time
> 7. We start rolling back the procedure:
> {code:java}
> 2018-10-08 06:51:24,440 WARN  [PEWorker-9] procedure2.ProcedureExecutor: 
> Usually this should not happen, we will release the lock before if the 
> procedure is finished, even if the holdLock is true, arrive here means we 
> have some holes where we do not release the lock. And the releaseLock below 
> may fail since the procedure may have already been deleted from the procedure 
> store.
> 2018-10-08 06:51:24,543 INFO  [PEWorker-9] 
> procedure.MasterProcedureScheduler: pid=48, ppid=47, 
> state=FAILED:REGION_TRANSITION_QUEUE, 
> exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via 
> AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max 
> attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 
> checking lock on 1588230740
> {code}
> {code:java}
> 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: 
> CODE-BUG: Uncaught runtime exception for pid=47, 
> state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, 
> exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via 
> AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max 
> attempts exceeded; ServerCrashProcedure 
> server=<ip-address>,16020,1538974612843, splitWal=true, meta=true
> java.lang.UnsupportedOperationException: unhandled 
> state=SERVER_CRASH_GET_REGIONS
>       at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254)
>       at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58)
>       at 
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203)
>       at 
> org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981)
> {code}
> {code:java}
> { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call exception, tries=7, 
> retries=7, started=8168 ms ago, cancelled=false, msg=Meta region is in state 
> OPENING, details=row 'backup:system' on table 'hbase:meta' at 
> region=hbase:meta,,1.1588230740, hostname=<hostname>, seqNum=-1, 
> exception=java.io.IOException: Meta region is in state OPENING
>         at 
> org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$null$1(ZKAsyncRegistry.java:154)
>         at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>         at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>         at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>         at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
>         at 
> org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$getAndConvert$0(ZKAsyncRegistry.java:77)
>         at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>         at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>         at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>         at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
>         at 
> org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient$ZKTask$1.exec(ReadOnlyZKClient.java:165)
>         at 
> org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient.run(ReadOnlyZKClient.java:323)
>         at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever

Reply via email to