Hi,

On 2026-02-18 19:37:16 +0200, Ants Aasma wrote:
> I was also able to convince clang and gcc to vectorize these loops. I
> had to split the innermost loop so the time calculation with the /1000
> for microseconds conversion and the conditional histogram loop are
> done separately, mark all the loop with nounroll pragmas, and tag the
> innermost loop for vectorization by clang. But looking at the
> benchmark results below, probably not worth the effort.

I don't think the CPU efficiency of flushing stats should ever matter to the
degree that vectorizing should be a goal. If it does, we are either keeping
way too many stats or are flushing way way too often.

What matters is to reduce the overhead when doing process local accounting, as
that will typically happen many orders of magnitude more frequently than
flushing / merging stats.



> For the I/O collection, I tried using prewarm, but got really noisy
> results from it. So instead I created a table with 100k rows with one
> row per page, vacuumed it and benchmarked select count(*) over it.
> Interestingly, setting effective_io_concurrency = 1 made the results
> both more consistent and faster.

It's an aside, but anyway: There's really not a whole lot of benefit of doing
AIO when the data is in the page cache. It can only accellerate things if
either checksum computations or memory bandwidth is a limiting factor, as with
worker mode both can be parallelized.  I don't think the checksum computation
commonly is a bottleneck with proper compiler optimization.  While memory
bandwith can be a major bottleneck on Intel server architectures, I haven't
seen that on AMD.

I'd probably, just out of paranoia, also test without checksums enabled (to
avoid the memory bandwidth hit) and see if the overhead increases if you
change the query to not need to evaluate expressions (e.g. by using SELECT *
FROM tbl OFFSET large_number, or using pg_prewarm with
maintenance_io_concurrency=1).


One thing to be aware of is that with the rdtsc[p] patch (to substantially
reduce timing overhead), it'll become a tad more expensive to convert an
instr_time to nanoseconds (due to having to convert cycles to nanoseconds).
It may be worth testing the combination.

On that note, why is this measuring things in nanoseconds, given that we
already conver instr_time to microseconds nearby and that its quite unlikely
that you'd ever have IO times below a microsecond and that
MIN_PG_STAT_IO_HIST_LATENCY already is in the microsecond domain and we
display it as microseconds?


Just rediscovered that the per-backend tracking patch added an external
function call to pgstat_count_io_op_time(), pgstat_count_backend_io_op() and
that a fair number of more recently added branches are constants at the
callsite :(. Probably doesn't matter, but makes me sad nonetheless :)




> I still want to look at the memory overhead more closely. The 30kB per
> backend seems tolerable to me

One thing worth thinking about here is that we probably could stand to
increase the number of IO types further, we e.g. have been talking about
tracking IO that bypasses shared buffers separately.  And a few more context
types (e.g. index inner/leaf) could also make sense.

Without that change that'd be a somewhat moderate increase in memory usage,
but with this change it'd increase a lot more.


> but I think having it in PgStat_BktypeIO is not great. This makes
> PgStat_IO 30k*BACKEND_NUM_TYPES bigger, or ~ 0.5MB. Having a stats snapshot
> be half a megabyte bigger for no reason seems too wasteful.

Yea, that's not awesome.

I guess we could count IO as 4 byte integers, and shift all bucket counts down
in the rare case of an on overflow. It's just a 2x improvement, but ...

I think we might need to reduce the number of buckets somewhat.


Right now the lowest bucket is for 0-8 ms, the second for 8-16, the third for
16-32. I.e. the first bucket is the same width as the second. Is that
intentional?


Greetings,

Andres Freund


Reply via email to