[ 
https://issues.apache.org/jira/browse/HBASE-28834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Kishore Valeti updated HBASE-28834:
----------------------------------------
    Description: 
While investigating a production incident, we observed that some procedures are 
getting created but never getting executed until a HMaster failover.
 - master-2 was active & rs-1 holding meta
 - 18:40, bunch of RSs (~80) reported crashed & SCPs were created & being 
executed
 - 19:51, balancer decided to move Meta region to another RS -> TRSP created -> 
Meta region went offline
 - 19:52, RS carrying meta crashed -> SCP created
 - 19:52 - Both TRSP & SCP seemed stuck/not executing  - No more logs related 
to these procedures
 - 19:55 - RPC queue size slightly increased to ~700
 - 21:09 - Master failed over from master-2 to master-3
 - Procs were loaded from store & attached.
 - 21:17 -  When the TRSP for meta had completed, meta came back online.

I will post the logs in some time.

We have a theory that the TRSP & SCP related to meta were submitted but never 
got executed possibly due to sitting in procedure queue. HMaster thread dumps 
would have been helpful but unfortunately, one was not availble. We do have RPC 
queue metrics at high level but having Procedure queue metrics would 
holistically indicate what could have happened to the procedures.

  was:
While investigating a production incident, we observed that some procedures are 
getting created but never getting executed until a HMaster failover.

- master-2 was active & rs-1 holding meta
- 18:40, bunch of RSs (~80) reported crashed & SCPs were created & being 
executed
- 19:51, balancer decided to move Meta region to another RS -> TRSP created -> 
Meta region went offline
- 19:52, RS carrying meta crashed -> SCP created
- 19:52 - Both TRSP & SCP seemed stuck/not executing  - No more logs related to 
these procedures
- 21:09 - Master failed over from master-2 to master-3
- Procs were loaded from store & attached.
- 21:17 -  When the TRSP for meta had completed, meta came back online.


I will post the logs in some time.


> Procedure queues & PE pool metrics
> ----------------------------------
>
>                 Key: HBASE-28834
>                 URL: https://issues.apache.org/jira/browse/HBASE-28834
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Ravi Kishore Valeti
>            Priority: Major
>
> While investigating a production incident, we observed that some procedures 
> are getting created but never getting executed until a HMaster failover.
>  - master-2 was active & rs-1 holding meta
>  - 18:40, bunch of RSs (~80) reported crashed & SCPs were created & being 
> executed
>  - 19:51, balancer decided to move Meta region to another RS -> TRSP created 
> -> Meta region went offline
>  - 19:52, RS carrying meta crashed -> SCP created
>  - 19:52 - Both TRSP & SCP seemed stuck/not executing  - No more logs related 
> to these procedures
>  - 19:55 - RPC queue size slightly increased to ~700
>  - 21:09 - Master failed over from master-2 to master-3
>  - Procs were loaded from store & attached.
>  - 21:17 -  When the TRSP for meta had completed, meta came back online.
> I will post the logs in some time.
> We have a theory that the TRSP & SCP related to meta were submitted but never 
> got executed possibly due to sitting in procedure queue. HMaster thread dumps 
> would have been helpful but unfortunately, one was not availble. We do have 
> RPC queue metrics at high level but having Procedure queue metrics would 
> holistically indicate what could have happened to the procedures.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to