[ 
https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16656264#comment-16656264
 ] 

Josh Elser commented on HBASE-21344:
------------------------------------

I think Ankit also found (via unit tests in the patch) that you can {{move}} 
hbase:meta in the shell to fix this without mucking in ZK directly.

Trying to distill it down to the simplest explanation:

The RegionState for meta in ZK gets left as OPENING after a failed AP for meta 
(for us, due to a krb issue, but could be any cause) and then the master is 
restarted. Rollback of this AP doesn't undo the OPENING RegionState for meta 
back to OFFLINE (or is it CLOSED? whatever). Issue #1.

Issue #2 is that the IMP submitted by the new active Master doesn't get 
scheduled because we only check that meta is not OFFLINE, not that its state is 
anything but OPEN (similar problem we had for assignment issues in earlier 
2.0.x). Thus, we just sit and wait for meta to come online, but nothing is 
actually working on assinging meta.

Ensuring we submit that IMP fixes the issue, but setting the state back to 
OFFLINE during the rollback of AP would have also helped us avoid this. Ankit's 
patch should fix both of these.

One big open question Ankit and I both have is around resiliency of that IMP. 
What happens if, on master startup and IMP submission or replay, the IMP fails. 
Should we resubmit the IMP? Or, do we expect AM to come along and do this for 
us? Reading the code, it seems to imply that only SCP or IMP will assign meta 
which we think means we need to be more careful watching for IMP's successful 
completion.

Hope this helps, gents. I've reviewed Ankit patch before he posted here, and 
would say I'm +0.99 on it already (want to hear from you smart fellows, 
obviously). I'm offline on Friday, but will be back Monday if there's any 
questions for me. Ankit thoroughly has a handle on the problem, so you probably 
don't need me further :)

> hbase:meta location in ZooKeeper set to OPENING by the procedure which 
> eventually failed but precludes Master from assigning it forever
> ---------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-21344
>                 URL: https://issues.apache.org/jira/browse/HBASE-21344
>             Project: HBase
>          Issue Type: Bug
>          Components: proc-v2
>            Reporter: Ankit Singhal
>            Assignee: Ankit Singhal
>            Priority: Major
>         Attachments: HBASE-21344-branch-2.0.patch
>
>
> [~elserj] has already summarized it well.
> 1. hbase:meta was on RS8
> 2. RS8 crashed, SCP was queued for it, meta first
> 3. meta was marked OFFLINE
> 4. meta marked as OPENING on RS3
> 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue
> 6. We attempt the openRegion/assignment 10 times, failing each time
> 7. We start rolling back the procedure:
> {code:java}
> 2018-10-08 06:51:24,440 WARN  [PEWorker-9] procedure2.ProcedureExecutor: 
> Usually this should not happen, we will release the lock before if the 
> procedure is finished, even if the holdLock is true, arrive here means we 
> have some holes where we do not release the lock. And the releaseLock below 
> may fail since the procedure may have already been deleted from the procedure 
> store.
> 2018-10-08 06:51:24,543 INFO  [PEWorker-9] 
> procedure.MasterProcedureScheduler: pid=48, ppid=47, 
> state=FAILED:REGION_TRANSITION_QUEUE, 
> exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via 
> AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max 
> attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 
> checking lock on 1588230740
> {code}
> {code:java}
> 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: 
> CODE-BUG: Uncaught runtime exception for pid=47, 
> state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, 
> exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via 
> AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max 
> attempts exceeded; ServerCrashProcedure 
> server=<ip-address>,16020,1538974612843, splitWal=true, meta=true
> java.lang.UnsupportedOperationException: unhandled 
> state=SERVER_CRASH_GET_REGIONS
>       at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254)
>       at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58)
>       at 
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203)
>       at 
> org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981)
> {code}
> {code:java}
> { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call exception, tries=7, 
> retries=7, started=8168 ms ago, cancelled=false, msg=Meta region is in state 
> OPENING, details=row 'backup:system' on table 'hbase:meta' at 
> region=hbase:meta,,1.1588230740, hostname=<hostname>, seqNum=-1, 
> exception=java.io.IOException: Meta region is in state OPENING
>         at 
> org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$null$1(ZKAsyncRegistry.java:154)
>         at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>         at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>         at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>         at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
>         at 
> org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$getAndConvert$0(ZKAsyncRegistry.java:77)
>         at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>         at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>         at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>         at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
>         at 
> org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient$ZKTask$1.exec(ReadOnlyZKClient.java:165)
>         at 
> org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient.run(ReadOnlyZKClient.java:323)
>         at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to