Hi, On 2024-10-04 09:31:45 +0800, wenhui qiu wrote: > > It's implied, but to make it more explicit: One big efficiency advantage > of > > writes by checkpointer is that they are sorted and can often be combined > into > > larger writes. That's often a lot more efficient: For network attached > storage > > it saves you iops, for local SSDs it's much friendlier to wear leveling. > > thank you for explanation, I think bgwrite also can merge io ,It writes > asynchronously to the file system cache, scheduling by os, .
Because bgwriter writes are just ordered by their buffer id (further made less sequential due to only writing out not-recently-used buffers), they are often effectively random. The OS can't do much about that. > > Another aspect is that checkpointer's writes are much easier to pace over > time > > than e.g. bgwriters, because bgwriter is triggered by a fairly short term > > signal. Eventually we'll want to combine writes by bgwriter too, but > that's > > always going to be more expensive than doing it in a large batched fashion > > like checkpointer does. > > > I think we could improve checkpointer's pacing further, fwiw, by taking > into > > account that the WAL volume at the start of a spread-out checkpoint > typically > > is bigger than at the end. > > I'm also very keen to improve checkpoints , Whenever I do stress test, > bgwrite does not write dirty pages when the data set is smaller than > shard_buffer size, It *SHOULD NOT* do anything in that situation. There's absolutely nothing to be gained by bgwriter writing in that case. > Before the checkpoint, the pressure measurement tps was stable and the > highest during the entire pressure measurement phase,Other databases > refresh dirty pages at a certain frequency, at intervals, and at dirty page > water levels,They have a much smaller impact on performance when > checkpoints occur I doubt that slowdown is caused by bgwriter not being active enough. I suspect what you're seeing is one or more of: a) The overhead of doing full page writes (due to increasing the WAL volume). You could verify whether that's the case by turning full_page_writes off (but note that that's not generally safe!) or see if the overhead shrinks if you set wal_compression=zstd or wal_compression=lz4 (don't use pglz, it's too slow). b) The overhead of renaming WAL segments during recycling. You could see if this is related by specifying --wal-segsize 512 or such during initdb. Greetings, Andres