ITAGAKI Takahiro wrote:
Here is the latest version of Load distributed checkpoint patch.

Bgwriter has two goals:
1. keep enough buffers clean that normal backends never need to do a write
2. smooth checkpoints by writing buffers ahead of time

Load distributed checkpoints will do 2. in a much better way than the bgwriter_all_* guc options. I think we should remove that aspect of bgwriter in favor of this patch.

The scheduling of bgwriter gets quite complicated with the patch. If I'm reading it correctly, bgwriter will keep periodically writing buffers to achieve 1. while the "write"-phase of checkpoint is in progress. That makes sense; now that checkpoints take longer, we would miss goal 1. otherwise. But we don't do that in the "sleep-between-write-and-fsync"- and "fsync"-phases. We should, shouldn't we?

I'd suggest rearranging the code so that BgBufferSync and mdsync would basically stay like they are without the patch; the signature wouldn't change. To do the naps during a checkpoint, inject calls to new functions like CheckpointWriteNap() and CheckpointFsyncNap() inside BgBufferSync and mdsync. Those nap functions would check if enough progress has been made since last call and sleep if so.

The piece of code that implements 1. would be refactored to a new function, let's say BgWriteLRUBuffers(). The nap-functions would call BgWriteLRUBuffers if more than bgwriter_delay milliseconds have passed since last call to it.

This way the changes to CreateCheckpoint, BgBufferSync and mdsync would be minimal, and bgwriter would keep cleaning buffers for normal backends during the whole checkpoint.

Another thought is to have a separate checkpointer-process so that the bgwriter process can keep cleaning dirty buffers while the checkpoint is running in a separate process. One problem with that is that we currently collect all the fsync requests in bgwriter. If we had a separate checkpointer process, we'd need to do that in the checkpointer instead, and bgwriter would need to send a message to the checkpointer every time it flushes a buffer, which would be a lot of chatter. Alternatively, bgwriter could somehow pass the pendingOpsTable to the checkpointer process at the beginning of checkpoint, but that not exactly trivial either.

PS. Great that you're working on this. It's a serious problem under heavy load.

  Heikki Linnakangas

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at


Reply via email to