[ https://issues.apache.org/jira/browse/HELIX-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Harry Zhang updated HELIX-753: ------------------------------ Summary: Record top state handoff finished in single cluster data cache refresh (was: record top state handoff finished in single cluster data cache refresh) > Record top state handoff finished in single cluster data cache refresh > ---------------------------------------------------------------------- > > Key: HELIX-753 > URL: https://issues.apache.org/jira/browse/HELIX-753 > Project: Apache Helix > Issue Type: Bug > Reporter: Harry Zhang > Assignee: Harry Zhang > Priority: Major > > Currently we are calculating top state handoff duration by doing the > following: > - record missing top state when we see a top state missing > - record top state come back when we see it come back > - report top state handoff duration > This is perfectly fine for non-P2P state transitions as the entire top state > handoff process will always finish for >= 2 pipeline runs. However, for P2P > enabled clusters, top state handoff are quick, and if it is quicker than > cluster data refresh stage latency, we will lose a lot of short top state > handoffs, which make the number miserable on ingraph. > We need to revise top state handoff metrics implementation so we don't lose > data point statistically (i.e. we are losing all short handoffs now). > AC: > - revise impl so we catch those short top state hand-offs > - write new tests to catch the fix if needed -- This message was sent by Atlassian JIRA (v7.6.3#76005)