[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ankit Singhal updated HBASE-21344: ---------------------------------- Attachment: HBASE-21344-branch-2.0_v3.patch > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --------------------------------------------------------------------------------------------------------------------------------------- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 > Reporter: Ankit Singhal > Assignee: Ankit Singhal > Priority: Major > Attachments: HBASE-21344-branch-2.0.patch, > HBASE-21344-branch-2.0_v2.patch, HBASE-21344-branch-2.0_v3.patch > > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 > checking lock on 1588230740 > {code} > {code:java} > 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=47, > state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; ServerCrashProcedure > server=<ip-address>,16020,1538974612843, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_GET_REGIONS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981) > {code} > {code:java} > { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call exception, tries=7, > retries=7, started=8168 ms ago, cancelled=false, msg=Meta region is in state > OPENING, details=row 'backup:system' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, hostname=<hostname>, seqNum=-1, > exception=java.io.IOException: Meta region is in state OPENING > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$null$1(ZKAsyncRegistry.java:154) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$getAndConvert$0(ZKAsyncRegistry.java:77) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) > at > org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient$ZKTask$1.exec(ReadOnlyZKClient.java:165) > at > org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient.run(ReadOnlyZKClient.java:323) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)