[HACKERS] Partitioned checkpointing

Takashi Horikawa Thu, 10 Sep 2015 02:14:11 -0700

Hi All,

Recently, I have found a paper titled "Segmented Fussy Checkpointing for
Main Memory Databases" published in 1996 at ACM symposium on Applied
Computing, which inspired me to implement a similar mechanism in PostgreSQL.
Since the early evaluation results obtained from a 16 core server was beyond
my expectation, I have decided to submit a patch to be open for discussion
by community members interested in this mechanism.


Attached patch is a PoC (or mayby prototype) implementation of a partitioned
checkpointing on 9.5alpha2 The term 'partitioned' is used here instead of
'segmented' because I feel 'segmented' is somewhat confusable with 'xlog
segment,' etc. In contrast, the term 'partitioned' is not as it implies
almost the same concept of 'buffer partition,' thus I think it is suitable.

Background and my motivation is that performance dip due to checkpoint is a
major concern, and thus it is valuable to mitigate this issue. In fact, many
countermeasures have been attempted against this issue. As far as I know,
those countermeasures so far focus mainly on mitigating the adverse impact
due to disk writes to implement the buffer sync; recent highly regarded
'checkpointer continuous flushing' is a typical example. On the other hand,
I don't feel that another source of the performance dip has been heartily
addressed; full-page-write rush, which I call here, would be a major issue.
That is, the average size of transaction log (XLOG) records jumps up sharply
immediately after the beginning of each checkpoint, resulting in the
saturation of WAL write path including disk(s) for $PGDATA/pg_xlog and WAL
buffers.

In the following, I will describe early evaluation results and mechanism of
the partitioned checkpointing briefly.

1. Performance evaluation
1.1 Experimental setup
The configuration of the server machine was as follows.

CPU: Intel E5-2650 v2 (8 cores/chip) @ 2.60GHz x 2
Memory: 64GB
OS: Linux 2.6.32-504.12.2.el6.x86_64 (CentOS)
Storage: raid1 of 4 HDs (write back assumed using BBU) for $PGDATA/pg_xlog
         raid1 of 2 SSDs for $PGDATA (other than pg_xlog)

PostgreSQL settings
 shared_buffers = 28GB
 wal_buffers = 64MB
 checkpoint_timeout = 10min
 max_wal_size = 128GB
 min_wal_size = 8GB
 checkpoint_completion_target = 0.9

benchmark
 pgbench -M prepared -N -P 1 -T 3600

The scaling factor was 1000.  Both the number of clients (-c option) and
threads (-j option) were 120 for sync. commit case and 96 for async. commit
(synchronous_commit = off) case. These are chosen because maximum
throughputs were obtained under these conditions.

The server was connected to a client machine on which pgbench client program
run with a 1G ether. Since the client machine was not saturated in the
measurement and thus hardly affected the results, details of the client
machine are not described here.


1.2 Early results
The measurement results shown here are latency average, latency stddev, and
throughput (tps), which are the output of the pgbench program.

1.2.1 synchronous_commit = on
(a) 9.5alpha2(original)
latency average: 2.852 ms
latency stddev: 6.010 ms
tps = 42025.789717 (including connections establishing)
tps = 42026.137247 (excluding connections establishing)

(b) 9.5alpha2 with partitioned checkpointing
latency average: 2.815 ms
latency stddev: 2.317 ms
tps = 42575.301137 (including connections establishing)
tps = 42575.677907 (excluding connections establishing)

1.2.2 synchronous_commit = off
(a) 9.5alpha2(original)
latency average: 2.136 ms
latency stddev: 5.422 ms
tps = 44870.897907 (including connections establishing)
tps = 44871.215792 (excluding connections establishing)

(b) 9.5alpha2 with partitioned checkpointing
latency average: 2.085 ms
latency stddev: 1.529 ms
tps = 45974.617724 (including connections establishing)
tps = 45974.973604 (excluding connections establishing)

1.3 Summary
The partitioned checkpointing produced great improvement (reduction) in
latency stddev and slight improvement in latency average and tps; there was
no performance degradation. Therefore, there is an effect to stabilize the
operation in this partitioned checkpointing. In fact, the throughput
variation, obtained by -P 1 option, shows that the dips were mitigated in
both magnitude and frequency.

# Since I'm not sure whether it is OK to send an email to this mailing with
attaching some files other than patch, I refrain now from attaching raw
results (200K bytes of text/case) and result graphs in .jpg or .epsf format
illustrating the throughput variations to this email. If it is OK, I'm
pleased to show the results in those formats.


2. Mechanism
Imaginably, 'partitioned checkpointing' conducts buffer sync not for all
buffers at once but for the buffers belonging to one partition at one
invocation of the checkpointer. In the following description, the number of
partitions is expressed by N. (N is fixed to 16 in the attached patch).

2.1 Principles of operations
In order to preserve the semantics of the traditional checkpointing, the
checkpointer invocation interval is changed to checkpoint_timeout / N. The
checkpointer carries out the buffer sync for the buffer partition 0 at the
first invocation, and then for the buffer partition 1 at the second
invocation, and so on. When the turn of the the buffer partition N-1 comes,
i.e. the last round of a series of buffer sync, the checkpointer carries out
the buffer sync for the buffer partition and other usual checkpoint
operations, coded in CheckPointGuts() in xlog.c.

The principle is that, roughly speaking, 1) checkpointing for the buffer
partition 0 corresponds to the beginning of the traditional checkpointing,
where the XLOG location (LSN) is obtained and set to RedoRecPtr, and 2)
checkpointing for the buffer partition N - 1 corresponds to the end of the
traditional checkpointing, where the WAL files that are no longer need (up
to the previous log segment of that specified by the RedoRecPtr value) are
deleted or recycled.

A role of RedoRecPtr indicating the threshold to determine whether FPW is
necessary or not is moved to a new N-element array of XLogRecPtr, as the
threshold for each buffer is different among partitions. The n-th element of
the array is updated when the buffer sync for partition n is carried out.

2.2 Drawbacks
The 'partitioned checkpointing' works effectively in such situation that the
checkpointer is invoked by hitting the checkpoint_timeout; performance dip
is mitigated and the WAL size is not changed (in avarage).

On the other hand, when the checkpointer is invoked by another trigger event
than timeout, traditional checkpoint procedure which syncs all buffers at
once will take place, resulting in performance dip. Also the WAL size for
that checkpoint period (until the next invocation of the checkpointer) will
theoritically increase to 1.5 times of that of usual case because of the
increase in the FPW.

My opinion is that this is not serious because it is preferable for
checkpointer to be invoked by the timeout, and thus usual systems are
supposed to be tuned to work under such condition that is prefarable for the
'partitioned checkpointing.'


3. Conclusion
The 'partitioned checkpointing' mechanism is expected to be effective for
mitigating the performance dip due to checkpoint. In particular, it is
noteworthy that the effect was observed on a server machine that use SSDs
for $PGDATA, for which seek optimizations are not believed effective.
Therefore, this mechanism is worth to further investigation aiming to
implement in future PostgreSQL.
--
Takashi Horikawa
NEC Corporation
Knowledge Discovery Research Laboratories

partitioned-checkpointing.patch
Description: Binary data

smime.p7s
Description: S/MIME cryptographic signature

[HACKERS] Partitioned checkpointing

Reply via email to