[ 
https://issues.apache.org/jira/browse/HELIX-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harry Zhang updated HELIX-753:
------------------------------
    Summary: Record top state handoff finished in single cluster data cache 
refresh  (was: record top state handoff finished in single cluster data cache 
refresh)

> Record top state handoff finished in single cluster data cache refresh
> ----------------------------------------------------------------------
>
>                 Key: HELIX-753
>                 URL: https://issues.apache.org/jira/browse/HELIX-753
>             Project: Apache Helix
>          Issue Type: Bug
>            Reporter: Harry Zhang
>            Assignee: Harry Zhang
>            Priority: Major
>
> Currently we are calculating top state handoff duration by doing the 
> following:
>  - record missing top state when we see a top state missing
>  - record top state come back when we see it come back
>  - report top state handoff duration
> This is perfectly fine for non-P2P state transitions as the entire top state 
> handoff process will always finish for >= 2 pipeline runs. However, for P2P 
> enabled clusters, top state handoff are quick, and if it is quicker than 
> cluster data refresh stage latency, we will lose a lot of short top state 
> handoffs, which make the number miserable on ingraph.
> We need to revise top state handoff metrics implementation so we don't lose 
> data point statistically (i.e. we are losing all short handoffs now).
> AC:
>  - revise impl so we catch those short top state hand-offs
>  - write new tests to catch the fix if needed



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to