On Mon, Jun 30, 2025 at 01:36:12PM +0000, Bertrand Drouvot wrote: > It could also be useful to observe the engine/backends behavior over time and > help answer questions like: > > * Is the engine’s wait pattern the same over time? > * Is application’s "A" wait pattern the same over time? > * I observe a peak in wait event "W": is it because "W" is now waiting longer > or > is it because it is hit more frequently? > * I observe a peak in some of them (say for example MultiXact%), is it due to > a > workload change?
To disclose some information here. This is something that we've found as useful to know from our user base, an area where Postgres lacks compared to other projects with similar properties. > * Do we need to serialize the stats based on their names (as for > PGSTAT_KIND_REPLSLOT)? This question is the same as "is the ordering > preserved > if file stats format is not changed": I think the answer is yes (see > f98dbdeb51d) > , which means there is no need for to_serialized_name/from_serialized_name. The serialization is important for the custom wait events, much less for wait events related to injection points (where we don't really care) which could be filtered out when reporting the stats. It seems to me that associating the names (perhaps as $CLASS/$NAME on disk as a single string, two strings with two lengths are perhaps better) is going to be needed. When the stats are read, the wait event ID could be allocated from the startup process in the from_serialized_name callback associated to the new stats kind, before extensions would try to allocate it. > * What if a new wait event is added? We'd need to change the stats file > format, > unless: the wait event stats kind becomes a variable one or we change a bit > the > way fixed stats are written/read to/from the stat file (we could add a new > field > in the PgStat_KindInfo for example). I am pretty sure that everybody is going to forget about this dependency, myself included. Please note that we add wait events in stable branches, sometimes as a portion of a bug fix, and we cannot bump the version of the stats file or stats would be lost on restart. So these stats should be variable, going back to the custom wait events as well. For the stats waiting to be flushed, you could allocate a single array of WAIT_EVENT_CUSTOM_HASH_MAX_SIZE elements in the TopMemoryContext of each backend, as we know that the number of custom wait events is never going to be higher than that, indexed by ID assigned. It may be simpler to allocate a series of arrays, one for each class, as well, sized depending on the wait event enum sizes. These could be indexed based on the enums generated in wait_event_types.h, minus their initial value (for the array of events in WaitEventTimeout, minus PG_WAIT_TIMEOUT needs to be applied). Deriving the contents that you need for the sizing and indexing of the pending data based on WaitClassTableEntry seems indeed necessary at some degree. Now it your proposal feels overcomplicated, and that the names may not be necessary as we already have calls able to do class name and event name lookups based on the uint32 class/name values. So we should not require a second way to do a name <=> key lookup. The to_serialization_name callback could then call pgstat_get_wait_event() to get the event names, and pgstat_get_wait_event_type() for the classes when writing the entries. > Note: for some backends the wait events stats are not flushed (walwriter for > example), so we need to find additional places to flush the wait events stats. This is not new. This limitation exists for some auxiliary processes or processes within their main loops. As long as we have APIs that can be called to do the flushes, IMO we are fine, like we do for WAL, IO, etc. > It might be better for PGSTAT_KIND_WAIT_EVENT to be a variable sized stats > kind. > That way: > > * It would be easier to add custom wait events if we want to > * It would be possible to add new wait events without having to change the > stats > file format > > So adding 0004 to see what it would look like to have a variable sized stats > kind > instead and decide how we want to move forward. IMO, I think that we should just do that. ABI compatibility and custom wait events make fixed-sized stats unfit with what you are trying to achieve here. > That said it might be better to use all the 64 bits (means not have the half > full > of zeroes) for the hash key (better hashing distribution?): we could imagine > using something like: > > ((uint64) wait_event_info) | (((uint64) wait_event_info) << 32) > > for the hash key. Using uint32 for the hash key is fine. At least it's not an issue for the pgstats machinery. So I would recommend to stick to WAIT_EVENT_CLASS_MASK and WAIT_EVENT_ID_MASK for the computation of the key of the entry inserted in shmem. Then when writing the stats we can guess what are the class name and event names based on the key. -- Michael
signature.asc
Description: PGP signature