Hi Manan, Thanks for the review.
Regarding Point 1: Gauge vs Counter (metric type) You are correct that /proc/self/io values are monotonically increasing process-lifetime accumulators, they only go up and reset on restart. but here's why I think Gauge is still the right choice : 1. Yammer Metrics constraint — Kafka uses Yammer Metrics The Public Interfaces section of the KIP intentionally lists Type: Gauge to reflect the actual JMX registration. All seven metrics — including the two existing ones — are polled from /proc/self/io. Using Gauge with a poll-style read is consistent with how the original two metrics were implemented. Yammer Metrics' Counter type would require push-style .inc(n) increments, which means a true counter would require manually tracking deltas between polls, introducing statefulness and drift risk. 2. Consistency — the 2 pre-existing disk metrics (read_bytes, write_bytes) are Gauges. Changing the 5 new metrics to Counter while leaving those 2 as Gauge would create an inconsistent interface. If the type is to change, it would have to be a coordinated change across all 7, which is a larger breaking discussion for the KIP knowing only the additive nature of this KIP. Regarding Point 2: Process-level vs per-disk granularity The "Metric Behaviour During Operations" section of the KIP already notes "Metrics reflect total process I/O per broker, regardless of partition ownership". Further, I have strengthened this by adding an explicit callout in the Public Interfaces section that the linux-disk-* prefix is a historical naming convention inherited from the two existing metrics, and that all seven metrics reflect aggregate broker JVM process I/O across all configured log.dirs, not per-disk activity. Best, Sahil Devgon On Wed, Mar 11, 2026 at 3:23 PM Manan Gupta <[email protected]> wrote: > Thanks for the detailed proposal — this looks useful for improving I/O > observability. > > > Regarding the metric type: the proposal describes these metrics as > *monotonically > increasing counters*, but they are exposed as *Gauge* metrics. Since > /proc/self/io values only increase during the process lifetime and reset on > restart, would it make more sense to expose them as counters instead? This > might make rate calculations more straightforward for monitoring systems > that expect cumulative metrics for deriving rates. > > Since /proc/self/io provides *process-level metrics*, these values will > aggregate I/O across all Kafka log directories.In deployments where brokers > use *multiple log.dirs across different disks*, operators often rely on > per-disk observability to diagnose hotspots or imbalanced usage.Do you see > any risk that these metrics could be misinterpreted as disk-level signals, > or should the documentation explicitly clarify that they reflect *aggregate > broker process I/O* rather than per-disk activity? > > Thanks, > Manan > > On Mon, Mar 9, 2026 at 9:37 PM Sahil Devgon <[email protected]> > wrote: > > > Hi Team, just checking if you were able to review the KIP and have some > > comments or suggestions on this thread. > > https://cwiki.apache.org/confluence/x/co48G > > > > If there are no comments , I intend to start a Vote thread for the same > in > > the coming days. > > > > Thanks, > > Sahil Devgon > > > > On Wed, Feb 25, 2026 at 8:20 PM Sahil Devgon <[email protected]> > > wrote: > > > > > Hi, > > > > > > I would like to start a discussion thread on KIP-1291. In this KIP, we > > aim > > > to expose all 7 Linux I/O metrics from /proc/self/io instead of just > the > > > current 2 (read_bytes and write_bytes). > > > The 5 additional metrics (rchar, wchar, syscr, syscw, > > > cancelled_write_bytes) enable operators to diagnose cache > effectiveness, > > > write amplification, and I/O pattern inefficiencies. > > > > > > https://cwiki.apache.org/confluence/x/co48G > > > > > > Please review the KIP and feel free to share your thoughts. > > > > > > Thanks, > > > Sahil Devgon > > > > > > > > >
