Am 25.09.25 um 10:27 schrieb Fiona Ebner: > Am 22.09.25 um 7:26 PM schrieb Thomas Lamprecht: >> Am 22.09.25 um 12:18 schrieb Fiona Ebner: >>> If disk read/write cannot be queried because of QMP timeout, they >>> should not be reported as 0, but the last value should be re-used. >>> Otherwise, the difference between that reported 0 and the next value, >>> when the stats are queried successfully, will show up as a huge spike >>> in the RRD graphs. >> >> Fine with the idea in general, but this is effectively relevant for >> the pvestatd only though? >> >> As of now we would also cache in the API daemon, without every using >> this. Might not be _that_ much, so not really a problem of the amount, >> but feels a bit wrong to me w.r.t. "code place". >> >> Has pvestatd the necessary info, directly on indirectly through the >> existence of some other vmstatus properties, to derive when it can >> safely reuse the previous value? > > It's safe (and sensible/required) if and only if there is no new value. > We could have the cache be only inside pvestatd, initialize the cache > with a value of 0 and properly report diskread/write values as undef if > we cannot get an actual value, and have that mean "re-use previous > value". (Aside: we cannot use 0 instead of undef to mean "re-use > previous value", because there are edge cases where a later 0 actually > means 0 again, for example, all disk unplugged).
Yeah, it would have to be a invalid value like -1, but even that is naturally not ideal, an explicit undefined or null value would naturally be better to signal what's happening. >> Or maybe we could make this caching opt-in through some module flag >> that only pvestatd sets? But not really thought that through, so >> please take this with a grain of salt. >> >> btw. what about QMP being "stuck" for a prolonged time, should we >> stop using the previous value after a few times (or duration)? > > What other value could we use? Since the graph looks at the differences > of reported values, the only reasonable value we can use if we cannot > get a new one is the previous one. No matter how long it takes to get a > new one, or there will be that completely wrong spike again. Or is there > a N/A kind of value that we could use, where RRD/graph would be smart > enough to know "I cannot calculate a difference now, will have to wait > for multiple good values"? Then I'd go for that instead of the current > approach. That should never be the problem of the metric collecting entity, but of the one interpreting or displaying the data, as else this is creating a false impression of reality. So the more I think of this, the more I'm sure that we won't do anybody a favor in the mid/long term here with "faking it" in the backend. I'd need to look into RRD, but even if there wasn't a way there to submit null-ish values, I'd rather see that as further argument for switching out RRD with the rust based proxmox-rrd crate, where we have control over these things, compared to recording measurements that did not happen. That does not mean that doing this correctly in proxmox-rrd will be trivial to do once we migrated–which is non-trivial on it's own–though. There are also some ideas to switching to a rather different way to encode metrics, using a more flexible format and stuff like delta encoding, i.e. closer to modern time series DBs like influxdb do it, Lukas signaled some interest in this work here. But that is vaporware as of now, so no need to wait on that to happen now, just wanted to mention it to not have those ideas isolated to much. But taking a step back, why is QMP even timing out here? Is this not just reading some in-memory counters that QEMU has ready to go? _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel