Hi Becket, Thanks for the feedback. Comments inline.
On Tue, May 9, 2017 at 12:19 AM, Becket Qin <becket....@gmail.com> wrote: > > q10. With event loop based controller design, it seems natrural to have the > processing time for each controller event type. In that case, the current > metrics seem not covering all the event processing time? e.g. (broker > joining and broker failure event handling time). > I'll leave it to Jun to explain why some metrics were not included in the proposal. q11. There are actually a couple of stages for controller state transition > and propagation. More specifically: > Stage 1: Event queue time > Stage 2: Event process time (including all the zk path updates) > Stage 3: State propagation time (including the state propagation queuing > time on the controller and the actual request sent to response receive > time) > > I think it worth to have metrics for each of the stage. Stage 3 might be a > little tricky because we may need to potentially manage the per broker > propagation time. Arguably the state propagation time is a little > overlapping with the broker side request handling time, but for some state > change that involves multiple types of requests, it could still be useful > to know what is the time for a state transition to be propagated to the > entire cluster. > Can you please elaborate on how you would like to see this measured. Do you mean that each event would have a separate metric for each of these stages? Maybe a worked out example would make it clear. q11. Regarding (13), controller actually do not update the ISR but just > read it. The error message seems from the brokers. It usually happens in > the following case: > 1. Current leader broker has the cached ZNode version V > 2. The controller elected a new leader and updated the ZNode, now ZNode > version becomes V+1, > 3. Controller sends the LeaderAndIsrRequest to the replica brokers to > propagate the new leader information as well as the new zkVersion. > 4. Before the old leader process the LeaderAndIsrRequest from controller in > step 3, The old leader tries to update ISR and found the cached zkVersion V > is different from the actual zkVersion V+1. > > It looks that this is not a controller metric. Yes, it's not a controller metric, that's why it's under the "Partition Metrics" section. In the PR, I actually implemented it in ReplicaManager alongside IsrExpandsPerSec and IsrShrinksPerSec. Thanks, Ismael