Hey Kevin,

Thanks for the update.

The KIP suggests that AtMinIsr is better than UnderReplicatedPartition as
indicator for alerting. However, in most case where min_isr =
replica_set_size - 1, these two metrics are exactly the same, where planned
maintenance can easily cause positive AtMinIsr value. In the other
scenario, for example let's say min_isr = 1 and replica_set_size = 3, it is
still possible that planned maintenance (e.g. one broker restart +
partition reassignment) can cause isr size drop to 1. Since AtMinIsr can
also cause fault positive (i.e. the fact that AtMinIsr > 0 does not
necessarily need attention from user), I am not sure it is worth to add
this metric.

In the Usage section, it is mentioned that user needs to manually check
whether there is ongoing maintenance after AtMinIsr is triggered. Could you
explain how is this different from the current way where we use
UnderReplicatedPartition to trigger alert? More specifically, can we just
replace AtMinIsr with UnderReplicatedPartition in the Usage section?

Thanks,
Dong


On Tue, Feb 26, 2019 at 6:49 PM Kevin Lu <lu.ke...@berkeley.edu> wrote:

> Hi Dong!
>
> Thanks for the feedback!
>
> You bring up a good point in that the AtMinIsr metric cannot be used to
> identify failure in the mentioned scenarios. I admit the motivation section
> placed too much emphasis on "identifying failure".
>
> I have modified the KIP to reflect the implementation as the AtMinIsr
> metric is intended to serve as a warning as one more failure to a partition
> AtMinIsr will cause producers with acks=ALL configured to fail. It has an
> additional benefit when minIsr=1 as it will warn us that the entire
> partition is at risk of going offline, but that is more of a side effect
> that only applies in that scenario (minIsr=1).
>
> Regards,
> Kevin
>
> On Tue, Feb 26, 2019 at 5:11 PM Dong Lin <lindon...@gmail.com> wrote:
>
> > Hey Kevin,
> >
> > Thanks for the proposal!
> >
> > It seems that the proposed implementation does not match the motivation.
> > The motivation suggests that the operator wants to tell the planned
> > maintenance (e.g. broker restart) from unplanned failure (e.g. network
> > failure). But the use of the metric AtMinIsr does not really
> differentiate
> > between these causes of the reduced number of ISR. For example, an
> > unplanned failure can cause ISR to drop from 3 to 2 but it can still be
> > higher than the minIsr (say 1). And a planned maintenance can cause ISR
> to
> > drop from 3 to 2, which trigger the AtMinIsr metric if minIsr=2. Can you
> > update the design doc to fix or explain this issue?
> >
> > Thanks,
> > Dong
> >
> > On Tue, Feb 12, 2019 at 9:02 AM Kevin Lu <lu.ke...@berkeley.edu> wrote:
> >
> > > Hi All,
> > >
> > > Getting the discussion thread started for KIP-427 in case anyone is
> free
> > > right now.
> > >
> > > I’d like to propose a new category of topic partitions *AtMinIsr* which
> > are
> > > partitions that only have the minimum number of in sync replicas left
> in
> > > the ISR set (as configured by min.insync.replicas).
> > >
> > > This would add two new metrics *ReplicaManager.AtMinIsrPartitionCount
> *&
> > > *Partition.AtMinIsr*, and a new TopicCommand option*
> > > --at-min-isr-partitions* to help in monitoring and alerting.
> > >
> > > KIP link: KIP-427: Add AtMinIsr topic partition category (new metric &
> > > TopicCommand option)
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103089398
> > > >
> > >
> > > Please take a look and let me know what you think.
> > >
> > > Regards,
> > > Kevin
> > >
> >
>

Reply via email to