[ 
https://issues.apache.org/jira/browse/HBASE-28834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Kishore Valeti updated HBASE-28834:
----------------------------------------
    Priority: Minor  (was: Major)

> Procedure queues & PE pool metrics
> ----------------------------------
>
>                 Key: HBASE-28834
>                 URL: https://issues.apache.org/jira/browse/HBASE-28834
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Ravi Kishore Valeti
>            Priority: Minor
>
> While investigating a production incident, we observed that some procedures 
> are getting created but never getting executed until a HMaster failover.
>  - master-2 was active & rs-1 holding meta
>  - 18:40, bunch of RSs (~80) reported crashed & SCPs were created & being 
> executed
>  - 19:51, balancer decided to move Meta region to another RS -> TRSP created 
> -> Meta region went offline
>  - 19:52, RS carrying meta crashed -> SCP created
>  - 19:52 - Both TRSP & SCP seemed stuck/not executing  - No more logs related 
> to these procedures
>  - 19:55 - RPC queue size slightly increased to ~700
>  - 21:09 - Master failed over from master-2 to master-3
>  - Procs were loaded from store & attached.
>  - 21:17 -  When the TRSP for meta had completed, meta came back online.
> I will post the logs in some time.
> We have a theory that the TRSP & SCP related to meta were submitted but never 
> got executed possibly due to sitting in procedure queue. HMaster thread dumps 
> would have been helpful but unfortunately, one was not availble. We do have 
> RPC queue metrics at high level but having Procedure queue metrics would 
> holistically indicate what could have happened to the procedures.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to