I had a similar comment to Becket but accidentally posted it on the vote thread last Friday. From that thread: "I noticed that both the ControllerState metric and the *RateAndTimeMs metrics only cover a subset of the controller event types. Was this intentional?"
I think it makes most sense to just have the ControllerState metric and *RateAndTimeMs metrics exactly match up with the event types. On Mon, May 8, 2017 at 4:35 PM, Ismael Juma <ism...@juma.me.uk> wrote: > Hi Becket, > > Thanks for the feedback. Comments inline. > > On Tue, May 9, 2017 at 12:19 AM, Becket Qin <becket....@gmail.com> wrote: >> >> q10. With event loop based controller design, it seems natrural to have >> the >> processing time for each controller event type. In that case, the current >> metrics seem not covering all the event processing time? e.g. (broker >> joining and broker failure event handling time). >> > > I'll leave it to Jun to explain why some metrics were not included in the > proposal. > > q11. There are actually a couple of stages for controller state transition >> and propagation. More specifically: >> Stage 1: Event queue time >> Stage 2: Event process time (including all the zk path updates) >> Stage 3: State propagation time (including the state propagation queuing >> time on the controller and the actual request sent to response receive >> time) >> >> I think it worth to have metrics for each of the stage. Stage 3 might be a >> little tricky because we may need to potentially manage the per broker >> propagation time. Arguably the state propagation time is a little >> overlapping with the broker side request handling time, but for some state >> change that involves multiple types of requests, it could still be useful >> to know what is the time for a state transition to be propagated to the >> entire cluster. >> > > Can you please elaborate on how you would like to see this measured. Do > you mean that each event would have a separate metric for each of these > stages? Maybe a worked out example would make it clear. > > q11. Regarding (13), controller actually do not update the ISR but just >> read it. The error message seems from the brokers. It usually happens in >> the following case: >> 1. Current leader broker has the cached ZNode version V >> 2. The controller elected a new leader and updated the ZNode, now ZNode >> version becomes V+1, >> 3. Controller sends the LeaderAndIsrRequest to the replica brokers to >> propagate the new leader information as well as the new zkVersion. >> 4. Before the old leader process the LeaderAndIsrRequest from controller >> in >> step 3, The old leader tries to update ISR and found the cached zkVersion >> V >> is different from the actual zkVersion V+1. >> >> It looks that this is not a controller metric. > > > Yes, it's not a controller metric, that's why it's under the "Partition > Metrics" section. In the PR, I actually implemented it in ReplicaManager > alongside IsrExpandsPerSec and IsrShrinksPerSec. > > Thanks, > Ismael >