Re: [HACKERS] Redesigning checkpoint_segments

Joshua D. Drake Thu, 06 Jun 2013 01:43:43 -0700


On 6/6/2013 1:11 AM, Heikki Linnakangas wrote:

(I'm sure you know this, but:) If you perform a checkpoint as fast andshort as possible, the sudden burst of writes and fsyncs willoverwhelm the I/O subsystem, and slow down queries. That's what we sawbefore spread checkpoints: when a checkpoint happens, the responsetimes of queries jumped up.

That isn't quite right. Previously we had lock issues as well andcheckpoints would take considerable time to complete. What I am talkingabout is that the background writer (and wal writer where applicable)have done all the work before a checkpoint is even called. Consider thateveryone of my clients that I am active with sets thecheckpoint_completion_target to 0.9. With a proper bgwriter config thisworks.

4. Bgwriter. We should be adjusting bgwriter so that it is writing
everything in a manner that allows any checkpoint to be in the range of
never noticed.


Oh, I see where you're going.


O.k. good. I am not nuts :D

Yeah, that would be one way to do it. However, spread checkpoints haspretty much the same effect. Imagine that you tune your system likethis: disable bgwriter altogether, and setcheckpoint_completion_target=0.9. With that, there will be acheckpoint in progress most of the time, because by the time onecheckpoint completes, it's almost time to begin the next one already.In that case, the checkpointer will be slowly performing the writes,all the time, in the background, without affecting queries. The effectis the same as what you described above, except that it's thecheckpointer doing the writing, not bgwriter.

O.k. if that is true, then we have redundant systems and we need toremove one of them.

Yeah, wal_keep_segments is a hack. We should replace it with somethingelse, like having a registry of standbys in the master, and how farthey've streamed. That way the master could keep around the amount ofWAL actually needed by them, not more not less. But that's a differentstory.
Other oddities:

Yes checkpoint_segments is awkward. We shouldn't have to set it at all.
It should be gone.
The point of having checkpoint_segments or max_wal_size is to put alimit (albeit a soft one) on the amount of disk space used. If youdon't care about that, I guess we could allow max_wal_size=-1 to meaninfinite, and checkpoints would be driven off purely based on time,not WAL consumption.

I would not only agree with that, I would argue that max_wal_sizedoesn't need to be there at least as a default. Perhaps as an "advanced"configuration option that only those in the know see.

Basically we start with X amount perhaps to be set at
initdb time. That X amount changes dynamically based on the amount of
data being written. In order to not suffer from recycling and creation
penalties we always keep X+N where N is enough to keep up with new data.
To clarify, here you're referring to controlling the number of WALsegments preallocated/recycled, rather than how often checkpoints aretriggered. Currently, both are derived from checkpoint_segments, but Iproposed to separate them. The above is exactly what I proposed to dofor the preallocation/recycling, it would be tuned automatically, butyou still need something like max_wal_size for the other thing, totrigger a checkpoint if too much WAL is being consumed.

You think so? I agree with 90% of this paragraph but it seems to me thatwe can find an algortihm that manages this without the idea ofmax_wal_size (at least as a user settable).

Along with the above, I don't see any reason for checkpoint_timeout.
Because of bgwriter we should be able to rather indefinitely not worry
about checkpoints (with a few exceptions such as pg_start_backup()).
Perhaps a setting that causes a checkpoint to happen based on some
non-artificial threshold (timeout) such as amount of data currently in
need of a checkpoint?
Either I'm not understanding what you said, or you're confused. Thepoint of checkpoint_timeout is put a limit on the time it will take torecover in case of crash. The relation between the two,checkpoint_timeout and how long it will take to recover after a crash,it not straightforward, but that's the best we have.

I may be confused but it is my understanding that bgwriter writes outthe data from the shared buffer cache that is dirty based on an intervaland a max pages written. If we are writing data continuously, we don'tneed checkpoints except for special cases (like pg_start_backup())?

Bgwriter does not worry about checkpoints. By "amount of datacurrently in need of a checkpoint", do you mean the number of dirtybuffers in shared_buffers, or something else? I don't see how or whythat should affect when you perform a checkpoint.

Heikki said, "I propose that we do something similar, but not exactly
the same. Let's have a setting, max_wal_size, to control the max. disk
space reserved for WAL. Once that's reached (or you get close enough, so
that there are still some segments left to consume while the checkpoint
runs), a checkpoint is triggered.

In this proposal, the number of segments preallocated is controlled
separately from max_wal_size, so that you can set max_wal_size high,
without actually consuming that much space in normal operation. It's
just a backstop, to avoid completely filling the disk, if there's a
sudden burst of activity. The number of segments preallocated is
auto-tuned, based on the number of segments used in previous checkpoint
cycles. "

This makes sense except I don't see a need for the parameter. Why not
just specify how the algorithm works and adhere to that without the need
for another GUC?

Because you want to limit the amount of disk space used for WAL. It'sa soft limit, but still.

Why? This is the point that confuses me. Why do we care? We don't carehow much disk space PGDATA takes... why do we all of a sudden care aboutpg_xlog?

Perhaps at any given point we save 10% of available
space (within a 16MB calculation) for pg_xlog, you hit it, we checkpoint
and LOG EXACTLY WHY.
Ah, but we don't know how much disk space is available. Even if wedid, there might be quotas or other constraints on the amount that wecan actually use. Or the DBA might not want PostgreSQL to use up allthe space, because there are other processes on the same system thatneed it.


We could however know how much disk space is available.

Sincerely,

JD

- Heikki




--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Redesigning checkpoint_segments

Reply via email to