ITAGAKI Takahiro wrote:
Here is the latest version of Load distributed checkpoint patch.
Bgwriter has two goals:
1. keep enough buffers clean that normal backends never need to do a write
2. smooth checkpoints by writing buffers ahead of time
Load distributed checkpoints will do 2. in a much better way than the
bgwriter_all_* guc options. I think we should remove that aspect of
bgwriter in favor of this patch.
The scheduling of bgwriter gets quite complicated with the patch. If I'm
reading it correctly, bgwriter will keep periodically writing buffers to
achieve 1. while the "write"-phase of checkpoint is in progress. That
makes sense; now that checkpoints take longer, we would miss goal 1.
otherwise. But we don't do that in the "sleep-between-write-and-fsync"-
and "fsync"-phases. We should, shouldn't we?
I'd suggest rearranging the code so that BgBufferSync and mdsync would
basically stay like they are without the patch; the signature wouldn't
change. To do the naps during a checkpoint, inject calls to new
functions like CheckpointWriteNap() and CheckpointFsyncNap() inside
BgBufferSync and mdsync. Those nap functions would check if enough
progress has been made since last call and sleep if so.
The piece of code that implements 1. would be refactored to a new
function, let's say BgWriteLRUBuffers(). The nap-functions would call
BgWriteLRUBuffers if more than bgwriter_delay milliseconds have passed
since last call to it.
This way the changes to CreateCheckpoint, BgBufferSync and mdsync would
be minimal, and bgwriter would keep cleaning buffers for normal backends
during the whole checkpoint.
Another thought is to have a separate checkpointer-process so that the
bgwriter process can keep cleaning dirty buffers while the checkpoint is
running in a separate process. One problem with that is that we
currently collect all the fsync requests in bgwriter. If we had a
separate checkpointer process, we'd need to do that in the checkpointer
instead, and bgwriter would need to send a message to the checkpointer
every time it flushes a buffer, which would be a lot of chatter.
Alternatively, bgwriter could somehow pass the pendingOpsTable to the
checkpointer process at the beginning of checkpoint, but that not
exactly trivial either.
PS. Great that you're working on this. It's a serious problem under
---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at