On Tue, Jan 21, 2025 at 11:23:06AM -0500, Andres Freund wrote:
> On 2025-01-21 10:13:14 -0600, Nathan Bossart wrote:
>> On Tue, Jan 21, 2025 at 09:52:51AM -0600, Nathan Bossart wrote:
>> > On Tue, Jan 21, 2025 at 03:31:27AM +0000, Andy Fan wrote:
>> >> 3. Why is the purpose of preallocated_segments directory? what in my
>> >> mind is we just prellocate the normal filename so that XLogWrite could
>> >> open it directly. This is same as what wal_recycle does and we can reuse
>> >> the same strategy to clean up them if they are not needed anymore.
>> > 
>> > The purpose is to limit the use of pre-allocated segments to only
>> > situations where WAL recycling is not sufficient.  Basically, if writing a
>> > record would require a new segment to be created, we can quickly pull a
>> > pre-allocated one instead of creating it ourselves.  Besides simplifying
>> > matters, this prevents a lot of unnecessary pre-allocation, since many
>> > workloads will almost never need anything beyond the recycled segments.
> 
> I don't really understand that argument - we should be able to predict rather
> precisely whether we need to preallocate or not. We have the recent WAL "fill
> rate", we know the end of the WAL and we can easily track how far ahead of the
> current point we have allocated.  Why preallocate when we have a large reserve
> of "future" segments? Why preallocate in a separate directory when we have no
> future segments?

If we can indeed reliably predict whether we need pre-allocation, then
sure, let's just create future segments directly in pg_wal.  I'm not sure
we could reliably predict whether WAL will be recycled in time, so we might
pre-allocate a bit more than necessary, but that's not too terrible.  My
"pooling" approach was intended to keep the pre-allocation to a minimum
(IME you really only need a couple at any given time) and to avoid the
guesswork involved in predicting.

>> That being said, it would be nice to avoid the fsync() overhead to move a
>> pre-allocated WAL into place.  My first instinct is that would be
>> substantially more complicated and may not actually improve matters all
>> that much, but I agree that it's worth exploring.
> 
> FWIW, I've seen the fsyncs around recycling being a rather substantial
> bottleneck. To the point of the main benefit of larger segments being the
> reduction in number of fsyncs at the end of a checkpoint.  I think we should
> be able to make the fsyncs a lot more efficient by batching them, first rename
> a bunch of files, then fsync them and the directory. The current pattern
> bascially requires a separate filesystem jouranl flush for each WAL segment.

+1, these kinds of fsync() patterns should be fixed.

-- 
nathan


Reply via email to