Hi, On 2026-02-18 19:37:16 +0200, Ants Aasma wrote: > I was also able to convince clang and gcc to vectorize these loops. I > had to split the innermost loop so the time calculation with the /1000 > for microseconds conversion and the conditional histogram loop are > done separately, mark all the loop with nounroll pragmas, and tag the > innermost loop for vectorization by clang. But looking at the > benchmark results below, probably not worth the effort.
I don't think the CPU efficiency of flushing stats should ever matter to the degree that vectorizing should be a goal. If it does, we are either keeping way too many stats or are flushing way way too often. What matters is to reduce the overhead when doing process local accounting, as that will typically happen many orders of magnitude more frequently than flushing / merging stats. > For the I/O collection, I tried using prewarm, but got really noisy > results from it. So instead I created a table with 100k rows with one > row per page, vacuumed it and benchmarked select count(*) over it. > Interestingly, setting effective_io_concurrency = 1 made the results > both more consistent and faster. It's an aside, but anyway: There's really not a whole lot of benefit of doing AIO when the data is in the page cache. It can only accellerate things if either checksum computations or memory bandwidth is a limiting factor, as with worker mode both can be parallelized. I don't think the checksum computation commonly is a bottleneck with proper compiler optimization. While memory bandwith can be a major bottleneck on Intel server architectures, I haven't seen that on AMD. I'd probably, just out of paranoia, also test without checksums enabled (to avoid the memory bandwidth hit) and see if the overhead increases if you change the query to not need to evaluate expressions (e.g. by using SELECT * FROM tbl OFFSET large_number, or using pg_prewarm with maintenance_io_concurrency=1). One thing to be aware of is that with the rdtsc[p] patch (to substantially reduce timing overhead), it'll become a tad more expensive to convert an instr_time to nanoseconds (due to having to convert cycles to nanoseconds). It may be worth testing the combination. On that note, why is this measuring things in nanoseconds, given that we already conver instr_time to microseconds nearby and that its quite unlikely that you'd ever have IO times below a microsecond and that MIN_PG_STAT_IO_HIST_LATENCY already is in the microsecond domain and we display it as microseconds? Just rediscovered that the per-backend tracking patch added an external function call to pgstat_count_io_op_time(), pgstat_count_backend_io_op() and that a fair number of more recently added branches are constants at the callsite :(. Probably doesn't matter, but makes me sad nonetheless :) > I still want to look at the memory overhead more closely. The 30kB per > backend seems tolerable to me One thing worth thinking about here is that we probably could stand to increase the number of IO types further, we e.g. have been talking about tracking IO that bypasses shared buffers separately. And a few more context types (e.g. index inner/leaf) could also make sense. Without that change that'd be a somewhat moderate increase in memory usage, but with this change it'd increase a lot more. > but I think having it in PgStat_BktypeIO is not great. This makes > PgStat_IO 30k*BACKEND_NUM_TYPES bigger, or ~ 0.5MB. Having a stats snapshot > be half a megabyte bigger for no reason seems too wasteful. Yea, that's not awesome. I guess we could count IO as 4 byte integers, and shift all bucket counts down in the rare case of an on overflow. It's just a 2x improvement, but ... I think we might need to reduce the number of buckets somewhat. Right now the lowest bucket is for 0-8 ms, the second for 8-16, the third for 16-32. I.e. the first bucket is the same width as the second. Is that intentional? Greetings, Andres Freund
