Hi,
On 2026-02-15 22:17:05 +0100, Tomas Vondra wrote:
> I don't have access to a M1 machine (and it also does not say what type
> of storage is it using, which seems pretty important for a patch aiming
> to improve I/O behavior). But I tried running this on my ryzen machine
> with local SSDs (in RAID0), and with the 100k rows (and fixed handling
> of page cache) I get this:
>
> column_name io_method evict n master_ms off_ms on_ms effect_pct
> periodic worker off 10 35.8 35.1 36.5 2.0
> periodic worker os 10 49.4 49.9 58.8 8.1
> periodic worker pg 10 39.5 39.9 47.1 8.3
> random worker off 10 35.9 35.6 35.7 0.2
> random worker os 10 49.0 49.0 42.6 -7.0
> random worker pg 10 39.6 39.9 40.9 1.2
> sequential worker off 10 28.2 27.9 27.7 -0.4
> sequential worker os 10 39.3 39.2 34.8 -6.0
> sequential worker pg 10 30.1 30.1 29.4 -1.3
>
> column_name io_method evict n master_ms off_ms on_ms effect_pct
> periodic io_uring off 10 35.9 35.8 35.8 -0.1
> periodic io_uring os 10 49.3 49.9 50.0 0.1
> periodic io_uring pg 10 40.1 39.8 41.7 2.4
> random io_uring off 10 35.6 35.2 35.7 0.8
> random io_uring os 10 49.1 48.9 46.1 -3.0
> random io_uring pg 10 39.8 40.1 42.6 3.1
> sequential io_uring off 10 28.0 27.8 28.0 0.4
> sequential io_uring os 10 39.8 39.1 40.7 1.9
> sequential io_uring pg 10 30.2 30.0 29.6 -0.8
>
> This is on default config with io_workers=12 and data_checksums=off. I'm
> not showing results for parallel query, because it's irrelevant.
>
> This also has timings for master, for worker and io_uring (which you
> could not get on M1, at least no in MacOS). For "worker" the differences
> are much smaller (within 10% in the worst case), and almost non-existent
> for io_uring. Which suggests this is likely due to the "signal" overhead
> associated with worker, which can be annoying for certain data patterns
> (where we end up issuing an I/O for individual blocks at distance 1).
I don't think this is just the signalling issue. For "periodic" I think it's
the signalling issue triggered by the read stream distance being kept too
low. Due to the small distance, the latency affects us much more.
Any my system, with turbo boost etc disabled.
worker w/ enable_indexscan_prefetch=0:
Index Scan using idx_periodic_100000 on prefetch_test_data_100000
(cost=0.29..15101.09 rows=100000 width=208) (actual time=0.157..84.129
rows=100000.00 loops=1)
Index Searches: 1
Buffers: shared hit=97150 read=3125
I/O Timings: shared read=31.274
Planning:
Buffers: shared hit=97 read=7
I/O Timings: shared read=0.595
Planning Time: 0.944 ms
Execution Time: 89.319 ms
worker w/ enable_indexscan_prefetch=1:
Index Scan using idx_periodic_100000 on prefetch_test_data_100000
(cost=0.29..15101.09 rows=100000 width=208) (actual time=0.158..115.279
rows=100000.00 loops=1)
Index Searches: 1
Prefetch: distance=1.060 count=99635 stalls=3004 skipped=0 resets=0 pauses=0
ungets=0 forwarded=0
histogram [1,2) => 93627, [2,4) => 6008
Buffers: shared hit=97150 read=3125
I/O Timings: shared read=56.077
Planning:
Buffers: shared hit=97 read=7
I/O Timings: shared read=0.612
Planning Time: 0.994 ms
Execution Time: 120.575 ms
Right, a regression. But note how low the distance is - no wonder the worker
latency has a bad effect - we only have the downside, never the upside, as
there's pretty much no IO concurrency.
After applying this diff:
@@ -1006,7 +1038,9 @@ read_stream_next_buffer(ReadStream *stream, void
**per_buffer_data)
stream->oldest_io_index = 0;
/* Look-ahead distance ramps up rapidly after we do I/O. */
- distance = stream->distance * 2;
+ distance = stream->distance * 2
+ + 1
+ ;
distance = Min(distance, stream->max_pinned_buffers);
stream->distance = distance;
worker w/ enable_indexscan_prefetch=1 + patch:
Index Scan using idx_periodic_100000 on prefetch_test_data_100000
(cost=0.29..15101.09 rows=100000 width=208) (actual time=0.157..82.673
rows=100000.00 loops=1)
Index Searches: 1
Prefetch: distance=70.892 count=103109 stalls=5 skipped=0 resets=0 pauses=0
ungets=3474 forwarded=0
histogram [1,2) => 88975, [2,4) => 5, [4,8) => 11, [8,16) => 26,
[16,32) => 28, [32,64) => 64, [64,128) => 104, [128,256) => 136, [256,512) =>
602, [512,1024) => 13158
Buffers: shared hit=97150 read=3125
I/O Timings: shared read=19.711
Planning:
Buffers: shared hit=97 read=7
I/O Timings: shared read=0.596
Planning Time: 0.951 ms
Execution Time: 87.887 ms
By no means a huge win compared to prefetching being disabled, but the
regression does vanish.
The problem that this fixes is that the periodic workload has cache hits
frequently, which reduce the stream->distance by 1. Then, on a miss, we double
the distance. But that means that if you have the trivial pattern of one hit
and one miss, which this workload very often has, you *never* get above
1. I.e. we increase the distance as quickly as we decrease it.
Greetings,
Andres Freund