Hi Manan,

Thanks for the review.

Regarding Point 1: Gauge vs Counter (metric type)
You are correct that /proc/self/io values are monotonically increasing
process-lifetime accumulators,
they only go up and reset on restart. but here's why I think Gauge is still
the right choice :

1. Yammer Metrics constraint — Kafka uses Yammer Metrics
   The Public Interfaces section of the KIP intentionally lists Type: Gauge
to reflect the actual JMX registration.
   All seven metrics — including the two existing ones — are polled from
/proc/self/io.
   Using Gauge with a poll-style read is consistent with how the original
two metrics were implemented.
   Yammer Metrics' Counter type would require push-style .inc(n)
increments, which means a true counter would
   require manually tracking deltas between polls, introducing statefulness
and drift risk.

2. Consistency — the 2 pre-existing disk metrics (read_bytes, write_bytes)
are Gauges.
   Changing the 5 new metrics to Counter while leaving those 2 as Gauge
would create an
   inconsistent interface. If the type is to change, it would have to be a
coordinated change across all 7, which is a
   larger breaking discussion for the KIP knowing only the additive nature
of this KIP.

Regarding Point 2: Process-level vs per-disk granularity

The "Metric Behaviour During Operations" section of the KIP already notes
"Metrics reflect total process I/O per broker,
regardless of partition ownership". Further, I have strengthened this by
adding an explicit callout in the Public Interfaces
section that the linux-disk-* prefix is a historical naming convention
inherited from the two existing metrics, and that
all seven metrics reflect aggregate broker JVM process I/O across all
configured log.dirs, not per-disk activity.

Best,
Sahil Devgon

On Wed, Mar 11, 2026 at 3:23 PM Manan Gupta <[email protected]> wrote:

> Thanks for the detailed proposal — this looks useful for improving I/O
> observability.
>
>
> Regarding the metric type: the proposal describes these metrics as
> *monotonically
> increasing counters*, but they are exposed as *Gauge* metrics. Since
> /proc/self/io values only increase during the process lifetime and reset on
> restart, would it make more sense to expose them as counters instead? This
> might make rate calculations more straightforward for monitoring systems
> that expect cumulative metrics for deriving rates.
>
> Since /proc/self/io provides *process-level metrics*, these values will
> aggregate I/O across all Kafka log directories.In deployments where brokers
> use *multiple log.dirs across different disks*, operators often rely on
> per-disk observability to diagnose hotspots or imbalanced usage.Do you see
> any risk that these metrics could be misinterpreted as disk-level signals,
> or should the documentation explicitly clarify that they reflect *aggregate
> broker process I/O* rather than per-disk activity?
>
> Thanks,
> Manan
>
> On Mon, Mar 9, 2026 at 9:37 PM Sahil Devgon <[email protected]>
> wrote:
>
> > Hi Team, just checking if you were able to review the KIP and have some
> > comments or suggestions on this thread.
> > https://cwiki.apache.org/confluence/x/co48G
> >
> > If there are no comments , I intend to start a Vote thread for the same
> in
> > the coming days.
> >
> > Thanks,
> > Sahil Devgon
> >
> > On Wed, Feb 25, 2026 at 8:20 PM Sahil Devgon <[email protected]>
> > wrote:
> >
> > > Hi,
> > >
> > > I would like to start a discussion thread on KIP-1291. In this KIP, we
> > aim
> > > to expose all 7 Linux I/O metrics from /proc/self/io instead of just
> the
> > > current 2 (read_bytes and write_bytes).
> > > The 5 additional metrics (rchar, wchar, syscr, syscw,
> > > cancelled_write_bytes) enable operators to diagnose cache
> effectiveness,
> > > write amplification, and I/O pattern inefficiencies.
> > >
> > > https://cwiki.apache.org/confluence/x/co48G
> > >
> > > Please review the KIP and feel free to share your thoughts.
> > >
> > > Thanks,
> > > Sahil Devgon
> > >
> > >
> >
>

Reply via email to