Re: [DISCUSS] KIP-1291: Extended I/O Metrics Collection

Manan Gupta Thu, 12 Mar 2026 21:55:19 -0700

Hey Sahil

I agree with your points. The proposal looks good to me.


Thanks,
Manan

On Thu, Mar 12, 2026 at 5:43 PM Sahil Devgon <[email protected]>
wrote:

> Hi Manan,
>
> Thanks for the review.
>
> Regarding Point 1: Gauge vs Counter (metric type)
> You are correct that /proc/self/io values are monotonically increasing
> process-lifetime accumulators,
> they only go up and reset on restart. but here's why I think Gauge is still
> the right choice :
>
> 1. Yammer Metrics constraint — Kafka uses Yammer Metrics
>    The Public Interfaces section of the KIP intentionally lists Type: Gauge
> to reflect the actual JMX registration.
>    All seven metrics — including the two existing ones — are polled from
> /proc/self/io.
>    Using Gauge with a poll-style read is consistent with how the original
> two metrics were implemented.
>    Yammer Metrics' Counter type would require push-style .inc(n)
> increments, which means a true counter would
>    require manually tracking deltas between polls, introducing statefulness
> and drift risk.
>
> 2. Consistency — the 2 pre-existing disk metrics (read_bytes, write_bytes)
> are Gauges.
>    Changing the 5 new metrics to Counter while leaving those 2 as Gauge
> would create an
>    inconsistent interface. If the type is to change, it would have to be a
> coordinated change across all 7, which is a
>    larger breaking discussion for the KIP knowing only the additive nature
> of this KIP.
>
> Regarding Point 2: Process-level vs per-disk granularity
>
> The "Metric Behaviour During Operations" section of the KIP already notes
> "Metrics reflect total process I/O per broker,
> regardless of partition ownership". Further, I have strengthened this by
> adding an explicit callout in the Public Interfaces
> section that the linux-disk-* prefix is a historical naming convention
> inherited from the two existing metrics, and that
> all seven metrics reflect aggregate broker JVM process I/O across all
> configured log.dirs, not per-disk activity.
>
> Best,
> Sahil Devgon
>
> On Wed, Mar 11, 2026 at 3:23 PM Manan Gupta <[email protected]> wrote:
>
> > Thanks for the detailed proposal — this looks useful for improving I/O
> > observability.
> >
> >
> > Regarding the metric type: the proposal describes these metrics as
> > *monotonically
> > increasing counters*, but they are exposed as *Gauge* metrics. Since
> > /proc/self/io values only increase during the process lifetime and reset
> on
> > restart, would it make more sense to expose them as counters instead?
> This
> > might make rate calculations more straightforward for monitoring systems
> > that expect cumulative metrics for deriving rates.
> >
> > Since /proc/self/io provides *process-level metrics*, these values will
> > aggregate I/O across all Kafka log directories.In deployments where
> brokers
> > use *multiple log.dirs across different disks*, operators often rely on
> > per-disk observability to diagnose hotspots or imbalanced usage.Do you
> see
> > any risk that these metrics could be misinterpreted as disk-level
> signals,
> > or should the documentation explicitly clarify that they reflect
> *aggregate
> > broker process I/O* rather than per-disk activity?
> >
> > Thanks,
> > Manan
> >
> > On Mon, Mar 9, 2026 at 9:37 PM Sahil Devgon <[email protected]>
> > wrote:
> >
> > > Hi Team, just checking if you were able to review the KIP and have some
> > > comments or suggestions on this thread.
> > > https://cwiki.apache.org/confluence/x/co48G
> > >
> > > If there are no comments , I intend to start a Vote thread for the same
> > in
> > > the coming days.
> > >
> > > Thanks,
> > > Sahil Devgon
> > >
> > > On Wed, Feb 25, 2026 at 8:20 PM Sahil Devgon <[email protected]
> >
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I would like to start a discussion thread on KIP-1291. In this KIP,
> we
> > > aim
> > > > to expose all 7 Linux I/O metrics from /proc/self/io instead of just
> > the
> > > > current 2 (read_bytes and write_bytes).
> > > > The 5 additional metrics (rchar, wchar, syscr, syscw,
> > > > cancelled_write_bytes) enable operators to diagnose cache
> > effectiveness,
> > > > write amplification, and I/O pattern inefficiencies.
> > > >
> > > > https://cwiki.apache.org/confluence/x/co48G
> > > >
> > > > Please review the KIP and feel free to share your thoughts.
> > > >
> > > > Thanks,
> > > > Sahil Devgon
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-1291: Extended I/O Metrics Collection

Reply via email to