[jira] [Updated] (PHOENIX-7872) Add client-side metrics for HA failover, mutation-block, and CRR cache health

Lokesh Khurana (Jira) Thu, 28 May 2026 14:35:09 -0700


     [ 
https://issues.apache.org/jira/browse/PHOENIX-7872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Lokesh Khurana updated PHOENIX-7872:
------------------------------------
    Description: 
The HA client today emits metrics only for the PARALLEL policy 
(HA_PARALLEL_COUNT_*) and HA executor pool task counters (taskRejectedCounter, 
taskExecutedCounter,
  taskEndToEndCounter). There is no client-side observability for the FAILOVER 
policy's transitions, the mutation-block path, or the CRR cache health. This 
makes it hard for operators
  to detect failover events, measure failover duration, alert on mutation-block 
hit rate, or diagnose CRR cache staleness without scraping logs.

  Proposed metrics, split into two tiers.

  Tier 1 — client-side counters mirroring the existing 
PhoenixHAGroupMetrics.HAMetricType enum pattern:

  - HA_FAILOVER_COUNT — Counter, emitted at FailoverPhoenixConnection.failover()
  - HA_FAILOVER_DURATION_MS — Histogram, emitted at the same site, around the 
failover try/finally
  - HA_MUTATION_BLOCKED_COUNT — Counter, emitted at MutationState.send catch 
site for MutationBlockedIOException causes
  - HA_STALE_CRR_DETECTED_COUNT — Counter, emitted at 
FailoverPhoenixConnection.wrapActionDuringFailover SCRE catch site

  Tier 2 — cross-cutting + server-side:

  - HA_CRR_REFRESH_COUNT — Counter, emitted at 
HighAvailabilityGroup.refreshClusterRoleRecord()
  - HA_CRR_CACHE_AGE_MS — Gauge, sampled at every connect()
  - HA_POLLER_TICK_COUNT — Counter, emitted in the 
GetClusterRoleRecordUtil.schedulePoller lambda
  - HA_POLLER_TICK_FAILURES — Counter, emitted at the same site, in the catch 
block
  - HA_BYPASSED_MUTATION_BLOCK_COUNT — Counter, emitted server-side at 
IndexRegionObserver.preBatchMutate when the _HAGroupName attribute is absent 
and the mutation proceeds
  (bypass-detection counter)

  Tier-1 lands first as a self-contained client-side change. Tier-2 stacks on 
Tier-1 and includes the server-side counter, which needs a shared 
IndexRegionObserverMetrics JMX surface.

  was:
The HA client today emits metrics only for the PARALLEL policy 
(HA_PARALLEL_COUNT_*) and HA executor pool task counters (taskRejectedCounter, 
taskExecutedCounter, taskEndToEndCounter). There is no client-side 
observability for the FAILOVER policy's transitions, the mutation-block path, 
or the CRR cache health. This makes it hard for operators to detect failover 
events, measure failover duration, alert on mutation-block hit rate, or 
diagnose CRR cache staleness without scraping logs.

 

Proposed metrics (split into two tiers):

  Tier 1 — client-side counters mirroring the existing 
PhoenixHAGroupMetrics.HAMetricType enum pattern:

  
┌─────────────────────────────┬───────────┬─────────────────────────────────────────────────────────────────────┐
  │           Metric            │   Type    │                           
Emission point                            │
  
├─────────────────────────────┼───────────┼─────────────────────────────────────────────────────────────────────┤
  │ HA_FAILOVER_COUNT           │ Counter   │ 
FailoverPhoenixConnection.failover()                                │
  
├─────────────────────────────┼───────────┼─────────────────────────────────────────────────────────────────────┤
  │ HA_FAILOVER_DURATION_MS     │ Histogram │ Same site, around try/finally     
                                  │
  
├─────────────────────────────┼───────────┼─────────────────────────────────────────────────────────────────────┤
  │ HA_MUTATION_BLOCKED_COUNT   │ Counter   │ MutationState.send catch site for 
MutationBlockedIOException causes │
  
├─────────────────────────────┼───────────┼─────────────────────────────────────────────────────────────────────┤
  │ HA_STALE_CRR_DETECTED_COUNT │ Counter   │ 
FailoverPhoenixConnection.wrapActionDuringFailover SCRE catch site  │
  
└─────────────────────────────┴───────────┴─────────────────────────────────────────────────────────────────────┘

  Tier 2 — cross-cutting + server-side:

  
┌──────────────────────────────────┬─────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
  │              Metric              │  Type   │                                
               Emission point                                               │
  
├──────────────────────────────────┼─────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ HA_CRR_REFRESH_COUNT             │ Counter │ 
HighAvailabilityGroup.refreshClusterRoleRecord()                                
                           │
  
├──────────────────────────────────┼─────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ HA_CRR_CACHE_AGE_MS              │ Gauge   │ Sample at every connect()      
                                                                            │
  
├──────────────────────────────────┼─────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ HA_POLLER_TICK_COUNT             │ Counter │ 
GetClusterRoleRecordUtil.schedulePoller lambda                                  
                           │
  
├──────────────────────────────────┼─────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ HA_POLLER_TICK_FAILURES          │ Counter │ Same site, catch block         
                                                                            │
  
├──────────────────────────────────┼─────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ HA_BYPASSED_MUTATION_BLOCK_COUNT │ Counter │ Server-side 
IndexRegionObserver.preBatchMutate when _HAGroupName attribute is absent and 
mutation proceeds │
  
└──────────────────────────────────┴─────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

  Tier-1 lands first as a self-contained client-side change. Tier-2 stacks on 
Tier-1 and includes the server-side counter, which needs a shared 
IndexRegionObserverMetrics JMX surface.


> Add client-side metrics for HA failover, mutation-block, and CRR cache health
> -----------------------------------------------------------------------------
>
>                 Key: PHOENIX-7872
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-7872
>             Project: Phoenix
>          Issue Type: Sub-task
>            Reporter: Lokesh Khurana
>            Assignee: Lokesh Khurana
>            Priority: Major
>
> The HA client today emits metrics only for the PARALLEL policy 
> (HA_PARALLEL_COUNT_*) and HA executor pool task counters 
> (taskRejectedCounter, taskExecutedCounter,
>   taskEndToEndCounter). There is no client-side observability for the 
> FAILOVER policy's transitions, the mutation-block path, or the CRR cache 
> health. This makes it hard for operators
>   to detect failover events, measure failover duration, alert on 
> mutation-block hit rate, or diagnose CRR cache staleness without scraping 
> logs.
>   Proposed metrics, split into two tiers.
>   Tier 1 — client-side counters mirroring the existing 
> PhoenixHAGroupMetrics.HAMetricType enum pattern:
>   - HA_FAILOVER_COUNT — Counter, emitted at 
> FailoverPhoenixConnection.failover()
>   - HA_FAILOVER_DURATION_MS — Histogram, emitted at the same site, around the 
> failover try/finally
>   - HA_MUTATION_BLOCKED_COUNT — Counter, emitted at MutationState.send catch 
> site for MutationBlockedIOException causes
>   - HA_STALE_CRR_DETECTED_COUNT — Counter, emitted at 
> FailoverPhoenixConnection.wrapActionDuringFailover SCRE catch site
>   Tier 2 — cross-cutting + server-side:
>   - HA_CRR_REFRESH_COUNT — Counter, emitted at 
> HighAvailabilityGroup.refreshClusterRoleRecord()
>   - HA_CRR_CACHE_AGE_MS — Gauge, sampled at every connect()
>   - HA_POLLER_TICK_COUNT — Counter, emitted in the 
> GetClusterRoleRecordUtil.schedulePoller lambda
>   - HA_POLLER_TICK_FAILURES — Counter, emitted at the same site, in the catch 
> block
>   - HA_BYPASSED_MUTATION_BLOCK_COUNT — Counter, emitted server-side at 
> IndexRegionObserver.preBatchMutate when the _HAGroupName attribute is absent 
> and the mutation proceeds
>   (bypass-detection counter)
>   Tier-1 lands first as a self-contained client-side change. Tier-2 stacks on 
> Tier-1 and includes the server-side counter, which needs a shared 
> IndexRegionObserverMetrics JMX surface.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (PHOENIX-7872) Add client-side metrics for HA failover, mutation-block, and CRR cache health

Reply via email to