On Tue, Jan 21, 2025 at 11:23:06AM -0500, Andres Freund wrote: > On 2025-01-21 10:13:14 -0600, Nathan Bossart wrote: >> On Tue, Jan 21, 2025 at 09:52:51AM -0600, Nathan Bossart wrote: >> > On Tue, Jan 21, 2025 at 03:31:27AM +0000, Andy Fan wrote: >> >> 3. Why is the purpose of preallocated_segments directory? what in my >> >> mind is we just prellocate the normal filename so that XLogWrite could >> >> open it directly. This is same as what wal_recycle does and we can reuse >> >> the same strategy to clean up them if they are not needed anymore. >> > >> > The purpose is to limit the use of pre-allocated segments to only >> > situations where WAL recycling is not sufficient. Basically, if writing a >> > record would require a new segment to be created, we can quickly pull a >> > pre-allocated one instead of creating it ourselves. Besides simplifying >> > matters, this prevents a lot of unnecessary pre-allocation, since many >> > workloads will almost never need anything beyond the recycled segments. > > I don't really understand that argument - we should be able to predict rather > precisely whether we need to preallocate or not. We have the recent WAL "fill > rate", we know the end of the WAL and we can easily track how far ahead of the > current point we have allocated. Why preallocate when we have a large reserve > of "future" segments? Why preallocate in a separate directory when we have no > future segments?
If we can indeed reliably predict whether we need pre-allocation, then sure, let's just create future segments directly in pg_wal. I'm not sure we could reliably predict whether WAL will be recycled in time, so we might pre-allocate a bit more than necessary, but that's not too terrible. My "pooling" approach was intended to keep the pre-allocation to a minimum (IME you really only need a couple at any given time) and to avoid the guesswork involved in predicting. >> That being said, it would be nice to avoid the fsync() overhead to move a >> pre-allocated WAL into place. My first instinct is that would be >> substantially more complicated and may not actually improve matters all >> that much, but I agree that it's worth exploring. > > FWIW, I've seen the fsyncs around recycling being a rather substantial > bottleneck. To the point of the main benefit of larger segments being the > reduction in number of fsyncs at the end of a checkpoint. I think we should > be able to make the fsyncs a lot more efficient by batching them, first rename > a bunch of files, then fsync them and the directory. The current pattern > bascially requires a separate filesystem jouranl flush for each WAL segment. +1, these kinds of fsync() patterns should be fixed. -- nathan