[ 
https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16661554#comment-16661554
 ] 

Ankit Singhal edited comment on HBASE-21344 at 10/24/18 1:30 AM:
-----------------------------------------------------------------

{quote}What you doing here from patch?{quote}
I'm just removing duplicate tableStateManager.start() (and keeping the 
tableStateManager.start() after checking meta is actually online).
And the test in the patch is to check if we have OPENING state for meta also, 
still SCP can succeed , so we don't need to change state of meta znode to 
offline during FAILED_OPEN of assign procedure(this meta znode state will 
intern also help in avoiding IMP if meta is in transition due to Server crash). 
Anyways, a sub-task to increase the no. of Assign procedure , will not let the 
call go in FAILED_OPEN path(I think).

{quote}
1096 Optional<ServerCrashProcedure> optProc = 
this.procedureExecutor.getProcedures().stream()
 1097 .filter(p -> p instanceof ServerCrashProcedure).map(o -> 
(ServerCrashProcedure) o)
 1098 .filter(s -> s.hasMetaTableRegion()).findAny();
 You are reporting SCPs only if they have meta on them? Isn't this method more 
generic than just meta searches?
{quote}
Yes, My bad, we don't need this particular change.


was (Author: an...@apache.org):
{quote}What you doing here from patch?{quote}
I'm just removing duplicate tableStateManager.start() (and keeping the 
tableStateManager.start() after checking meta is actually online).
And the test in the patch is to check if we have OPENING state for meta also, 
still SCP can succeed , so we don't need to change state of meta znode to 
offline during FAILED_OPEN of assign procedure(this meta znode state will 
intern also help in avoiding IMP if meta is in transition due to Server crash)

{quote}
1096 Optional<ServerCrashProcedure> optProc = 
this.procedureExecutor.getProcedures().stream()
 1097 .filter(p -> p instanceof ServerCrashProcedure).map(o -> 
(ServerCrashProcedure) o)
 1098 .filter(s -> s.hasMetaTableRegion()).findAny();
 You are reporting SCPs only if they have meta on them? Isn't this method more 
generic than just meta searches?
{quote}
Yes, My bad, we don't need this particular change.

> hbase:meta location in ZooKeeper set to OPENING by the procedure which 
> eventually failed but precludes Master from assigning it forever
> ---------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-21344
>                 URL: https://issues.apache.org/jira/browse/HBASE-21344
>             Project: HBase
>          Issue Type: Bug
>          Components: proc-v2
>            Reporter: Ankit Singhal
>            Assignee: Ankit Singhal
>            Priority: Major
>         Attachments: HBASE-21344-branch-2.0.patch, 
> HBASE-21344-branch-2.0_v2.patch
>
>
> [~elserj] has already summarized it well.
> 1. hbase:meta was on RS8
> 2. RS8 crashed, SCP was queued for it, meta first
> 3. meta was marked OFFLINE
> 4. meta marked as OPENING on RS3
> 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue
> 6. We attempt the openRegion/assignment 10 times, failing each time
> 7. We start rolling back the procedure:
> {code:java}
> 2018-10-08 06:51:24,440 WARN  [PEWorker-9] procedure2.ProcedureExecutor: 
> Usually this should not happen, we will release the lock before if the 
> procedure is finished, even if the holdLock is true, arrive here means we 
> have some holes where we do not release the lock. And the releaseLock below 
> may fail since the procedure may have already been deleted from the procedure 
> store.
> 2018-10-08 06:51:24,543 INFO  [PEWorker-9] 
> procedure.MasterProcedureScheduler: pid=48, ppid=47, 
> state=FAILED:REGION_TRANSITION_QUEUE, 
> exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via 
> AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max 
> attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 
> checking lock on 1588230740
> {code}
> {code:java}
> 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: 
> CODE-BUG: Uncaught runtime exception for pid=47, 
> state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, 
> exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via 
> AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max 
> attempts exceeded; ServerCrashProcedure 
> server=<ip-address>,16020,1538974612843, splitWal=true, meta=true
> java.lang.UnsupportedOperationException: unhandled 
> state=SERVER_CRASH_GET_REGIONS
>       at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254)
>       at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58)
>       at 
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203)
>       at 
> org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981)
> {code}
> {code:java}
> { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call exception, tries=7, 
> retries=7, started=8168 ms ago, cancelled=false, msg=Meta region is in state 
> OPENING, details=row 'backup:system' on table 'hbase:meta' at 
> region=hbase:meta,,1.1588230740, hostname=<hostname>, seqNum=-1, 
> exception=java.io.IOException: Meta region is in state OPENING
>         at 
> org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$null$1(ZKAsyncRegistry.java:154)
>         at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>         at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>         at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>         at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
>         at 
> org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$getAndConvert$0(ZKAsyncRegistry.java:77)
>         at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>         at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>         at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>         at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
>         at 
> org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient$ZKTask$1.exec(ReadOnlyZKClient.java:165)
>         at 
> org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient.run(ReadOnlyZKClient.java:323)
>         at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to