[jira] [Comment Edited] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever

Ankit Singhal (JIRA) Thu, 18 Oct 2018 21:02:16 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16656248#comment-16656248
 ]


Ankit Singhal edited comment on HBASE-21344 at 10/19/18 4:01 AM:
-----------------------------------------------------------------

[~stack]
bq. Is this 2.0.0 or branch-2.0?
I checked with branch-2.0 , may need to check for other branches as well.

bq. If Master is restarted, what happens?
Master will not get initialized and get stuck. because currently during 
startup, master checks whether meta region is offline or not to schedule an 
InitMetaProcedure.

{code}
java.net.SocketTimeoutException: callTimeout=1200000, callDuration=1213908: 
Meta region is in state OPENING row 'test_table' on table 'hbase:meta' at null
        at 
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:159)
        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:386)
        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:360)
        at 
org.apache.hadoop.hbase.MetaTableAccessor.getTableState(MetaTableAccessor.java:1066)
        at 
org.apache.hadoop.hbase.master.TableStateManager.readMetaState(TableStateManager.java:258)
        at 
org.apache.hadoop.hbase.master.TableStateManager.getTableState(TableStateManager.java:213)
        at 
org.apache.hadoop.hbase.master.TableStateManager.migrateZooKeeper(TableStateManager.java:338)
        at 
org.apache.hadoop.hbase.master.TableStateManager.start(TableStateManager.java:267)
        at 
org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:914)
        at 
org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2090)
        at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:553)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Meta region is in state OPENING
        at 
org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$null$1(ZKAsyncRegistry.java:154)
        at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
        at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
        at 
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
        at 
org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$getAndConvert$0(ZKAsyncRegistry.java:77)
        at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
        at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
        at 
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
        at 
org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient$ZKTask$1.exec(ReadOnlyZKClient.java:165)
        at 
org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient.run(ReadOnlyZKClient.java:323)
        ... 1 more
{code}

[~allan163]
bq. I have seen similar issues, a Assign procedure of SCP failed and rolled 
back, and the whole SCP rolled back, left some regions unassigned.
does the user regions are not assigned eventually by balancer or after restart 
master? We saw this specifically with meta region which is never assigned even 
after master restart until we delete the meta znode.




was (Author: [email protected]):
bq. Is this 2.0.0 or branch-2.0?
I checked with branch-2.0 , may need to check for other branches as well.

bq. If Master is restarted, what happens?
Master will not get initialized and get stuck. because currently during 
startup, master checks whether meta region is offline or not to schedule an 
InitMetaProcedure.

{code}
java.net.SocketTimeoutException: callTimeout=1200000, callDuration=1213908: 
Meta region is in state OPENING row 'test_table' on table 'hbase:meta' at null
        at 
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:159)
        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:386)
        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:360)
        at 
org.apache.hadoop.hbase.MetaTableAccessor.getTableState(MetaTableAccessor.java:1066)
        at 
org.apache.hadoop.hbase.master.TableStateManager.readMetaState(TableStateManager.java:258)
        at 
org.apache.hadoop.hbase.master.TableStateManager.getTableState(TableStateManager.java:213)
        at 
org.apache.hadoop.hbase.master.TableStateManager.migrateZooKeeper(TableStateManager.java:338)
        at 
org.apache.hadoop.hbase.master.TableStateManager.start(TableStateManager.java:267)
        at 
org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:914)
        at 
org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2090)
        at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:553)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Meta region is in state OPENING
        at 
org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$null$1(ZKAsyncRegistry.java:154)
        at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
        at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
        at 
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
        at 
org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$getAndConvert$0(ZKAsyncRegistry.java:77)
        at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
        at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
        at 
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
        at 
org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient$ZKTask$1.exec(ReadOnlyZKClient.java:165)
        at 
org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient.run(ReadOnlyZKClient.java:323)
        ... 1 more
{code}

bq. I have seen similar issues, a Assign procedure of SCP failed and rolled 
back, and the whole SCP rolled back, left some regions unassigned.
does the user regions are not assigned eventually by balancer or after restart 
master? We saw this specifically with meta region which is never assigned even 
after master restart until we delete the meta znode.

> hbase:meta location in ZooKeeper set to OPENING by the procedure which 
> eventually failed but precludes Master from assigning it forever
> ---------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-21344
>                 URL: https://issues.apache.org/jira/browse/HBASE-21344
>             Project: HBase
>          Issue Type: Bug
>          Components: proc-v2
>            Reporter: Ankit Singhal
>            Assignee: Ankit Singhal
>            Priority: Major
>         Attachments: HBASE-21344-branch-2.0.patch
>
>
> [~elserj] has already summarized it well.
> 1. hbase:meta was on RS8
> 2. RS8 crashed, SCP was queued for it, meta first
> 3. meta was marked OFFLINE
> 4. meta marked as OPENING on RS3
> 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue
> 6. We attempt the openRegion/assignment 10 times, failing each time
> 7. We start rolling back the procedure:
> {code:java}
> 2018-10-08 06:51:24,440 WARN  [PEWorker-9] procedure2.ProcedureExecutor: 
> Usually this should not happen, we will release the lock before if the 
> procedure is finished, even if the holdLock is true, arrive here means we 
> have some holes where we do not release the lock. And the releaseLock below 
> may fail since the procedure may have already been deleted from the procedure 
> store.
> 2018-10-08 06:51:24,543 INFO  [PEWorker-9] 
> procedure.MasterProcedureScheduler: pid=48, ppid=47, 
> state=FAILED:REGION_TRANSITION_QUEUE, 
> exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via 
> AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max 
> attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 
> checking lock on 1588230740
> {code}
> {code:java}
> 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: 
> CODE-BUG: Uncaught runtime exception for pid=47, 
> state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, 
> exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via 
> AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max 
> attempts exceeded; ServerCrashProcedure 
> server=<ip-address>,16020,1538974612843, splitWal=true, meta=true
> java.lang.UnsupportedOperationException: unhandled 
> state=SERVER_CRASH_GET_REGIONS
>       at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254)
>       at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58)
>       at 
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203)
>       at 
> org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981)
> {code}
> {code:java}
> { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call exception, tries=7, 
> retries=7, started=8168 ms ago, cancelled=false, msg=Meta region is in state 
> OPENING, details=row 'backup:system' on table 'hbase:meta' at 
> region=hbase:meta,,1.1588230740, hostname=<hostname>, seqNum=-1, 
> exception=java.io.IOException: Meta region is in state OPENING
>         at 
> org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$null$1(ZKAsyncRegistry.java:154)
>         at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>         at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>         at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>         at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
>         at 
> org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$getAndConvert$0(ZKAsyncRegistry.java:77)
>         at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>         at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>         at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>         at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
>         at 
> org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient$ZKTask$1.exec(ReadOnlyZKClient.java:165)
>         at 
> org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient.run(ReadOnlyZKClient.java:323)
>         at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever

Reply via email to