[
https://issues.apache.org/jira/browse/HBASE-28834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ravi Kishore Valeti updated HBASE-28834:
----------------------------------------
Priority: Minor (was: Major)
> Procedure queues & PE pool metrics
> ----------------------------------
>
> Key: HBASE-28834
> URL: https://issues.apache.org/jira/browse/HBASE-28834
> Project: HBase
> Issue Type: Improvement
> Reporter: Ravi Kishore Valeti
> Priority: Minor
>
> While investigating a production incident, we observed that some procedures
> are getting created but never getting executed until a HMaster failover.
> - master-2 was active & rs-1 holding meta
> - 18:40, bunch of RSs (~80) reported crashed & SCPs were created & being
> executed
> - 19:51, balancer decided to move Meta region to another RS -> TRSP created
> -> Meta region went offline
> - 19:52, RS carrying meta crashed -> SCP created
> - 19:52 - Both TRSP & SCP seemed stuck/not executing - No more logs related
> to these procedures
> - 19:55 - RPC queue size slightly increased to ~700
> - 21:09 - Master failed over from master-2 to master-3
> - Procs were loaded from store & attached.
> - 21:17 - When the TRSP for meta had completed, meta came back online.
> I will post the logs in some time.
> We have a theory that the TRSP & SCP related to meta were submitted but never
> got executed possibly due to sitting in procedure queue. HMaster thread dumps
> would have been helpful but unfortunately, one was not availble. We do have
> RPC queue metrics at high level but having Procedure queue metrics would
> holistically indicate what could have happened to the procedures.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)