[
https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16661395#comment-16661395
]
Ankit Singhal commented on HBASE-21344:
---------------------------------------
bq. This should be happening already. We wait on meta assign. If SCPs, they'll
run and recover meta if one of them was holding it. If no assign for meta in
the procedure store, then something untoward and at least for now, operator
needs to figure what happened until we fix the bug. Operator can schedule an
assign with hbck2....
bq. branch-2.0 will go into a holding pattern if hbase:meta is not assigned
(ditto if hbase:namespace is not assigned) waiting on operator intevention to
clear the lack-of-assign.
Thanks [~stack] for the pointer, I didn't go down as the problem was started
when we are starting tableStateManager without waiting for meta assignment by
SCPs. I think we can just remove this from here as we already starting after
waiting for meta to get online.(attached patch for the same)
{code}
if (initMetaProc != null) {
initMetaProc.await();
}
- tableStateManager.start();
{code}
bq. That said, I see some value in this patch. In particular the bit around
resetting hbase:meta state if failure.
We shouldn't offline the meta if we are failing the assignment as it will start
the InitMetaProcedure (which we don't want as SCP need to take care of
recovering of Meta).
> hbase:meta location in ZooKeeper set to OPENING by the procedure which
> eventually failed but precludes Master from assigning it forever
> ---------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-21344
> URL: https://issues.apache.org/jira/browse/HBASE-21344
> Project: HBase
> Issue Type: Bug
> Components: proc-v2
> Reporter: Ankit Singhal
> Assignee: Ankit Singhal
> Priority: Major
> Attachments: HBASE-21344-branch-2.0.patch,
> HBASE-21344-branch-2.0_v2.patch
>
>
> [~elserj] has already summarized it well.
> 1. hbase:meta was on RS8
> 2. RS8 crashed, SCP was queued for it, meta first
> 3. meta was marked OFFLINE
> 4. meta marked as OPENING on RS3
> 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue
> 6. We attempt the openRegion/assignment 10 times, failing each time
> 7. We start rolling back the procedure:
> {code:java}
> 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor:
> Usually this should not happen, we will release the lock before if the
> procedure is finished, even if the holdLock is true, arrive here means we
> have some holes where we do not release the lock. And the releaseLock below
> may fail since the procedure may have already been deleted from the procedure
> store.
> 2018-10-08 06:51:24,543 INFO [PEWorker-9]
> procedure.MasterProcedureScheduler: pid=48, ppid=47,
> state=FAILED:REGION_TRANSITION_QUEUE,
> exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via
> AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max
> attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740
> checking lock on 1588230740
> {code}
> {code:java}
> 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor:
> CODE-BUG: Uncaught runtime exception for pid=47,
> state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true,
> exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via
> AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max
> attempts exceeded; ServerCrashProcedure
> server=<ip-address>,16020,1538974612843, splitWal=true, meta=true
> java.lang.UnsupportedOperationException: unhandled
> state=SERVER_CRASH_GET_REGIONS
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254)
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58)
> at
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203)
> at
> org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981)
> {code}
> {code:java}
> { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call exception, tries=7,
> retries=7, started=8168 ms ago, cancelled=false, msg=Meta region is in state
> OPENING, details=row 'backup:system' on table 'hbase:meta' at
> region=hbase:meta,,1.1588230740, hostname=<hostname>, seqNum=-1,
> exception=java.io.IOException: Meta region is in state OPENING
> at
> org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$null$1(ZKAsyncRegistry.java:154)
> at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
> at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
> at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> at
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
> at
> org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$getAndConvert$0(ZKAsyncRegistry.java:77)
> at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
> at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
> at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> at
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
> at
> org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient$ZKTask$1.exec(ReadOnlyZKClient.java:165)
> at
> org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient.run(ReadOnlyZKClient.java:323)
> at java.lang.Thread.run(Thread.java:748)
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)