On Tue, Jan 27, 2026 at 1:06 PM Jakub Wartak <[email protected]>
wrote:

> [..]
> > > Of course most of the I/O calls today are hitting page cache, so one
> would
> > > expect they'll be < 128us most of the time
> >
> > Have you measured whether overhead is measurable when hitting the page
> cache?
> > I'd hope that it doesn't, due to io combing amortizing the cost
> somewhat. But
> > it seems worth measuring.
>
> Not yet, I first wanted to hear if I'm not sailing into some plain stupid
> direction somewhere with this idea or implementation (e.g.
> that INSTR_TIME_GET_MICROSEC() was a really stupid omission from my side).
>
> I'll try to perform this test overhead measurement hopefully with v3 once
> we settle on how to do that bit shifting/clz().
>

[..]
Here's the answer: on properly isolated perf test run (my
old&legacy&predictiable
4s32c64t NUMA box, s_b=8GB, DB size 16GB, hugepages, no turboboost, proper
warmup,
no THP, cpupower D0, no physical I/O, ~22k pread64() calls/sec combined to
VFS
cache)
    and started on just using single NUMA: numactl --membind=0
--cpunodebind=0
    measured using: pgbench -M prepared -c 4 -j 4 postgres -T 20 -P 1 -S

master+track_io_timings=on, 60s warmup and then 3x runs
    tps = 44615.603668
    tps = 44556.191492
    tps = 44813.793981
    avg = 44662

master+track_io_timings=on+patch, , 60s warmup and then 3x runs
    tps = 44441.879384
    tps = 44403.101737
    tps = 45036.747418
    avg = 44627

so that's like 99.921% (so literally no overhead) and yields picture like:

postgres=# select bucket_latency_us, bucket_count
from pg_stat_io_histogram
where
    bucket_count > 0 and
    backend_type = 'client backend' and
    io_type = 'read' and
    context = 'normal'
order by bucket_latency_us;
 bucket_latency_us | bucket_count
-------------------+--------------
 [0,9)             |       273455
 [8,17)            |      3820379
 [16,33)           |        29359
 [32,65)           |          585
 [64,129)          |          467
 [128,257)         |          419
 [256,513)         |        15828
 [512,1025)        |           89

So one can also see 0.25..0.5ms bucket being larger there (initial reading
of the data from physical device) and that's hardware RAID-1 with 2x
Intel D3-S4510. And if I do pg_buffercache_evict_all()+vm drop_cache+
pg_stat_reset_shared(), I get this picture (note for bulkreads):

postgres=# select bucket_latency_us, bucket_count
from pg_stat_io_histogram
where
    bucket_count > 0 and
    backend_type = 'client backend' and context='bulkread';
 bucket_latency_us | bucket_count
-------------------+--------------
 [0,9)             |       102555
 [8,17)            |           70
 [16,33)           |         3938
 [32,65)           |         6763
 [64,129)          |         5206
 [128,257)         |         9392
 [256,513)         |        10959
 [512,1025)        |        21372
 [1024,2049)       |          502
 [2048,4097)       |           34
 [4096,8193)       |            2
 [8192,16385)      |            2
 [16384,32769)     |            7

So clearly there's a distinction between reading the VFS cache and hitting
physical I/O.

Now we'll just probably settle on the proper get_bucket_index() impl.

-J.

Reply via email to