[
https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16657691#comment-16657691
]
Ankit Singhal edited comment on HBASE-21344 at 10/20/18 3:00 AM:
-----------------------------------------------------------------
bq. You need something for 2.0.0? 2.0.0 is tough because hbck2 only starts
working in 2.0.3 (not yet released) or tip of branch-2.0.
bq. If you can go to the tip of branch-2.0, you can use hbck2 to schedule an
assign of hbase:meta.
[~stack], do you think what we did(as described by [~elserj] in the last
comment) in the attached patch can help in this particular use-case?, I can
also look in hbck2 code to see if it takes care of meta when not assigned due
to any failure in IMP/SCP.
was (Author: [email protected]):
bq. You need something for 2.0.0? 2.0.0 is tough because hbck2 only starts
working in 2.0.3 (not yet released) or tip of branch-2.0.
bq. If you can go to the tip of branch-2.0, you can use hbck2 to schedule an
assign of hbase:meta.
[~stack], do you think what we did in the attached patch can help in this
particular use-case?, I'll also look in hbck2 code to see if it takes care of
meta when not assigned due to any failure in IMP/SCP.
> hbase:meta location in ZooKeeper set to OPENING by the procedure which
> eventually failed but precludes Master from assigning it forever
> ---------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-21344
> URL: https://issues.apache.org/jira/browse/HBASE-21344
> Project: HBase
> Issue Type: Bug
> Components: proc-v2
> Reporter: Ankit Singhal
> Assignee: Ankit Singhal
> Priority: Major
> Attachments: HBASE-21344-branch-2.0.patch
>
>
> [~elserj] has already summarized it well.
> 1. hbase:meta was on RS8
> 2. RS8 crashed, SCP was queued for it, meta first
> 3. meta was marked OFFLINE
> 4. meta marked as OPENING on RS3
> 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue
> 6. We attempt the openRegion/assignment 10 times, failing each time
> 7. We start rolling back the procedure:
> {code:java}
> 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor:
> Usually this should not happen, we will release the lock before if the
> procedure is finished, even if the holdLock is true, arrive here means we
> have some holes where we do not release the lock. And the releaseLock below
> may fail since the procedure may have already been deleted from the procedure
> store.
> 2018-10-08 06:51:24,543 INFO [PEWorker-9]
> procedure.MasterProcedureScheduler: pid=48, ppid=47,
> state=FAILED:REGION_TRANSITION_QUEUE,
> exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via
> AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max
> attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740
> checking lock on 1588230740
> {code}
> {code:java}
> 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor:
> CODE-BUG: Uncaught runtime exception for pid=47,
> state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true,
> exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via
> AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max
> attempts exceeded; ServerCrashProcedure
> server=<ip-address>,16020,1538974612843, splitWal=true, meta=true
> java.lang.UnsupportedOperationException: unhandled
> state=SERVER_CRASH_GET_REGIONS
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254)
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58)
> at
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203)
> at
> org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981)
> {code}
> {code:java}
> { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call exception, tries=7,
> retries=7, started=8168 ms ago, cancelled=false, msg=Meta region is in state
> OPENING, details=row 'backup:system' on table 'hbase:meta' at
> region=hbase:meta,,1.1588230740, hostname=<hostname>, seqNum=-1,
> exception=java.io.IOException: Meta region is in state OPENING
> at
> org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$null$1(ZKAsyncRegistry.java:154)
> at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
> at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
> at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> at
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
> at
> org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$getAndConvert$0(ZKAsyncRegistry.java:77)
> at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
> at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
> at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> at
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
> at
> org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient$ZKTask$1.exec(ReadOnlyZKClient.java:165)
> at
> org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient.run(ReadOnlyZKClient.java:323)
> at java.lang.Thread.run(Thread.java:748)
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)