[
https://issues.apache.org/jira/browse/PHOENIX-7872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lokesh Khurana updated PHOENIX-7872:
------------------------------------
Description:
The HA client today emits metrics only for the PARALLEL policy
(HA_PARALLEL_COUNT_*) and HA executor pool task counters (taskRejectedCounter,
taskExecutedCounter,
taskEndToEndCounter). There is no client-side observability for the FAILOVER
policy's transitions, the mutation-block path, or the CRR cache health. This
makes it hard for operators
to detect failover events, measure failover duration, alert on mutation-block
hit rate, or diagnose CRR cache staleness without scraping logs.
Proposed metrics, split into two tiers.
Tier 1 — client-side counters mirroring the existing
PhoenixHAGroupMetrics.HAMetricType enum pattern:
- HA_FAILOVER_COUNT — Counter, emitted at FailoverPhoenixConnection.failover()
- HA_FAILOVER_DURATION_MS — Histogram, emitted at the same site, around the
failover try/finally
- HA_MUTATION_BLOCKED_COUNT — Counter, emitted at MutationState.send catch
site for MutationBlockedIOException causes
- HA_STALE_CRR_DETECTED_COUNT — Counter, emitted at
FailoverPhoenixConnection.wrapActionDuringFailover SCRE catch site
Tier 2 — cross-cutting + server-side:
- HA_CRR_REFRESH_COUNT — Counter, emitted at
HighAvailabilityGroup.refreshClusterRoleRecord()
- HA_CRR_CACHE_AGE_MS — Gauge, sampled at every connect()
- HA_POLLER_TICK_COUNT — Counter, emitted in the
GetClusterRoleRecordUtil.schedulePoller lambda
- HA_POLLER_TICK_FAILURES — Counter, emitted at the same site, in the catch
block
- HA_BYPASSED_MUTATION_BLOCK_COUNT — Counter, emitted server-side at
IndexRegionObserver.preBatchMutate when the _HAGroupName attribute is absent
and the mutation proceeds
(bypass-detection counter)
Tier-1 lands first as a self-contained client-side change. Tier-2 stacks on
Tier-1 and includes the server-side counter, which needs a shared
IndexRegionObserverMetrics JMX surface.
was:
The HA client today emits metrics only for the PARALLEL policy
(HA_PARALLEL_COUNT_*) and HA executor pool task counters (taskRejectedCounter,
taskExecutedCounter, taskEndToEndCounter). There is no client-side
observability for the FAILOVER policy's transitions, the mutation-block path,
or the CRR cache health. This makes it hard for operators to detect failover
events, measure failover duration, alert on mutation-block hit rate, or
diagnose CRR cache staleness without scraping logs.
Proposed metrics (split into two tiers):
Tier 1 — client-side counters mirroring the existing
PhoenixHAGroupMetrics.HAMetricType enum pattern:
┌─────────────────────────────┬───────────┬─────────────────────────────────────────────────────────────────────┐
│ Metric │ Type │
Emission point │
├─────────────────────────────┼───────────┼─────────────────────────────────────────────────────────────────────┤
│ HA_FAILOVER_COUNT │ Counter │
FailoverPhoenixConnection.failover() │
├─────────────────────────────┼───────────┼─────────────────────────────────────────────────────────────────────┤
│ HA_FAILOVER_DURATION_MS │ Histogram │ Same site, around try/finally
│
├─────────────────────────────┼───────────┼─────────────────────────────────────────────────────────────────────┤
│ HA_MUTATION_BLOCKED_COUNT │ Counter │ MutationState.send catch site for
MutationBlockedIOException causes │
├─────────────────────────────┼───────────┼─────────────────────────────────────────────────────────────────────┤
│ HA_STALE_CRR_DETECTED_COUNT │ Counter │
FailoverPhoenixConnection.wrapActionDuringFailover SCRE catch site │
└─────────────────────────────┴───────────┴─────────────────────────────────────────────────────────────────────┘
Tier 2 — cross-cutting + server-side:
┌──────────────────────────────────┬─────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Metric │ Type │
Emission point │
├──────────────────────────────────┼─────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ HA_CRR_REFRESH_COUNT │ Counter │
HighAvailabilityGroup.refreshClusterRoleRecord()
│
├──────────────────────────────────┼─────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ HA_CRR_CACHE_AGE_MS │ Gauge │ Sample at every connect()
│
├──────────────────────────────────┼─────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ HA_POLLER_TICK_COUNT │ Counter │
GetClusterRoleRecordUtil.schedulePoller lambda
│
├──────────────────────────────────┼─────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ HA_POLLER_TICK_FAILURES │ Counter │ Same site, catch block
│
├──────────────────────────────────┼─────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ HA_BYPASSED_MUTATION_BLOCK_COUNT │ Counter │ Server-side
IndexRegionObserver.preBatchMutate when _HAGroupName attribute is absent and
mutation proceeds │
└──────────────────────────────────┴─────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Tier-1 lands first as a self-contained client-side change. Tier-2 stacks on
Tier-1 and includes the server-side counter, which needs a shared
IndexRegionObserverMetrics JMX surface.
> Add client-side metrics for HA failover, mutation-block, and CRR cache health
> -----------------------------------------------------------------------------
>
> Key: PHOENIX-7872
> URL: https://issues.apache.org/jira/browse/PHOENIX-7872
> Project: Phoenix
> Issue Type: Sub-task
> Reporter: Lokesh Khurana
> Assignee: Lokesh Khurana
> Priority: Major
>
> The HA client today emits metrics only for the PARALLEL policy
> (HA_PARALLEL_COUNT_*) and HA executor pool task counters
> (taskRejectedCounter, taskExecutedCounter,
> taskEndToEndCounter). There is no client-side observability for the
> FAILOVER policy's transitions, the mutation-block path, or the CRR cache
> health. This makes it hard for operators
> to detect failover events, measure failover duration, alert on
> mutation-block hit rate, or diagnose CRR cache staleness without scraping
> logs.
> Proposed metrics, split into two tiers.
> Tier 1 — client-side counters mirroring the existing
> PhoenixHAGroupMetrics.HAMetricType enum pattern:
> - HA_FAILOVER_COUNT — Counter, emitted at
> FailoverPhoenixConnection.failover()
> - HA_FAILOVER_DURATION_MS — Histogram, emitted at the same site, around the
> failover try/finally
> - HA_MUTATION_BLOCKED_COUNT — Counter, emitted at MutationState.send catch
> site for MutationBlockedIOException causes
> - HA_STALE_CRR_DETECTED_COUNT — Counter, emitted at
> FailoverPhoenixConnection.wrapActionDuringFailover SCRE catch site
> Tier 2 — cross-cutting + server-side:
> - HA_CRR_REFRESH_COUNT — Counter, emitted at
> HighAvailabilityGroup.refreshClusterRoleRecord()
> - HA_CRR_CACHE_AGE_MS — Gauge, sampled at every connect()
> - HA_POLLER_TICK_COUNT — Counter, emitted in the
> GetClusterRoleRecordUtil.schedulePoller lambda
> - HA_POLLER_TICK_FAILURES — Counter, emitted at the same site, in the catch
> block
> - HA_BYPASSED_MUTATION_BLOCK_COUNT — Counter, emitted server-side at
> IndexRegionObserver.preBatchMutate when the _HAGroupName attribute is absent
> and the mutation proceeds
> (bypass-detection counter)
> Tier-1 lands first as a self-contained client-side change. Tier-2 stacks on
> Tier-1 and includes the server-side counter, which needs a shared
> IndexRegionObserverMetrics JMX surface.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)