Hey Sahil I agree with your points. The proposal looks good to me.
Thanks, Manan On Thu, Mar 12, 2026 at 5:43 PM Sahil Devgon <[email protected]> wrote: > Hi Manan, > > Thanks for the review. > > Regarding Point 1: Gauge vs Counter (metric type) > You are correct that /proc/self/io values are monotonically increasing > process-lifetime accumulators, > they only go up and reset on restart. but here's why I think Gauge is still > the right choice : > > 1. Yammer Metrics constraint — Kafka uses Yammer Metrics > The Public Interfaces section of the KIP intentionally lists Type: Gauge > to reflect the actual JMX registration. > All seven metrics — including the two existing ones — are polled from > /proc/self/io. > Using Gauge with a poll-style read is consistent with how the original > two metrics were implemented. > Yammer Metrics' Counter type would require push-style .inc(n) > increments, which means a true counter would > require manually tracking deltas between polls, introducing statefulness > and drift risk. > > 2. Consistency — the 2 pre-existing disk metrics (read_bytes, write_bytes) > are Gauges. > Changing the 5 new metrics to Counter while leaving those 2 as Gauge > would create an > inconsistent interface. If the type is to change, it would have to be a > coordinated change across all 7, which is a > larger breaking discussion for the KIP knowing only the additive nature > of this KIP. > > Regarding Point 2: Process-level vs per-disk granularity > > The "Metric Behaviour During Operations" section of the KIP already notes > "Metrics reflect total process I/O per broker, > regardless of partition ownership". Further, I have strengthened this by > adding an explicit callout in the Public Interfaces > section that the linux-disk-* prefix is a historical naming convention > inherited from the two existing metrics, and that > all seven metrics reflect aggregate broker JVM process I/O across all > configured log.dirs, not per-disk activity. > > Best, > Sahil Devgon > > On Wed, Mar 11, 2026 at 3:23 PM Manan Gupta <[email protected]> wrote: > > > Thanks for the detailed proposal — this looks useful for improving I/O > > observability. > > > > > > Regarding the metric type: the proposal describes these metrics as > > *monotonically > > increasing counters*, but they are exposed as *Gauge* metrics. Since > > /proc/self/io values only increase during the process lifetime and reset > on > > restart, would it make more sense to expose them as counters instead? > This > > might make rate calculations more straightforward for monitoring systems > > that expect cumulative metrics for deriving rates. > > > > Since /proc/self/io provides *process-level metrics*, these values will > > aggregate I/O across all Kafka log directories.In deployments where > brokers > > use *multiple log.dirs across different disks*, operators often rely on > > per-disk observability to diagnose hotspots or imbalanced usage.Do you > see > > any risk that these metrics could be misinterpreted as disk-level > signals, > > or should the documentation explicitly clarify that they reflect > *aggregate > > broker process I/O* rather than per-disk activity? > > > > Thanks, > > Manan > > > > On Mon, Mar 9, 2026 at 9:37 PM Sahil Devgon <[email protected]> > > wrote: > > > > > Hi Team, just checking if you were able to review the KIP and have some > > > comments or suggestions on this thread. > > > https://cwiki.apache.org/confluence/x/co48G > > > > > > If there are no comments , I intend to start a Vote thread for the same > > in > > > the coming days. > > > > > > Thanks, > > > Sahil Devgon > > > > > > On Wed, Feb 25, 2026 at 8:20 PM Sahil Devgon <[email protected] > > > > > wrote: > > > > > > > Hi, > > > > > > > > I would like to start a discussion thread on KIP-1291. In this KIP, > we > > > aim > > > > to expose all 7 Linux I/O metrics from /proc/self/io instead of just > > the > > > > current 2 (read_bytes and write_bytes). > > > > The 5 additional metrics (rchar, wchar, syscr, syscw, > > > > cancelled_write_bytes) enable operators to diagnose cache > > effectiveness, > > > > write amplification, and I/O pattern inefficiencies. > > > > > > > > https://cwiki.apache.org/confluence/x/co48G > > > > > > > > Please review the KIP and feel free to share your thoughts. > > > > > > > > Thanks, > > > > Sahil Devgon > > > > > > > > > > > > > >
