Hi All, Recently, I have found a paper titled "Segmented Fussy Checkpointing for Main Memory Databases" published in 1996 at ACM symposium on Applied Computing, which inspired me to implement a similar mechanism in PostgreSQL. Since the early evaluation results obtained from a 16 core server was beyond my expectation, I have decided to submit a patch to be open for discussion by community members interested in this mechanism.
Attached patch is a PoC (or mayby prototype) implementation of a partitioned checkpointing on 9.5alpha2 The term 'partitioned' is used here instead of 'segmented' because I feel 'segmented' is somewhat confusable with 'xlog segment,' etc. In contrast, the term 'partitioned' is not as it implies almost the same concept of 'buffer partition,' thus I think it is suitable. Background and my motivation is that performance dip due to checkpoint is a major concern, and thus it is valuable to mitigate this issue. In fact, many countermeasures have been attempted against this issue. As far as I know, those countermeasures so far focus mainly on mitigating the adverse impact due to disk writes to implement the buffer sync; recent highly regarded 'checkpointer continuous flushing' is a typical example. On the other hand, I don't feel that another source of the performance dip has been heartily addressed; full-page-write rush, which I call here, would be a major issue. That is, the average size of transaction log (XLOG) records jumps up sharply immediately after the beginning of each checkpoint, resulting in the saturation of WAL write path including disk(s) for $PGDATA/pg_xlog and WAL buffers. In the following, I will describe early evaluation results and mechanism of the partitioned checkpointing briefly. 1. Performance evaluation 1.1 Experimental setup The configuration of the server machine was as follows. CPU: Intel E5-2650 v2 (8 cores/chip) @ 2.60GHz x 2 Memory: 64GB OS: Linux 2.6.32-504.12.2.el6.x86_64 (CentOS) Storage: raid1 of 4 HDs (write back assumed using BBU) for $PGDATA/pg_xlog raid1 of 2 SSDs for $PGDATA (other than pg_xlog) PostgreSQL settings shared_buffers = 28GB wal_buffers = 64MB checkpoint_timeout = 10min max_wal_size = 128GB min_wal_size = 8GB checkpoint_completion_target = 0.9 benchmark pgbench -M prepared -N -P 1 -T 3600 The scaling factor was 1000. Both the number of clients (-c option) and threads (-j option) were 120 for sync. commit case and 96 for async. commit (synchronous_commit = off) case. These are chosen because maximum throughputs were obtained under these conditions. The server was connected to a client machine on which pgbench client program run with a 1G ether. Since the client machine was not saturated in the measurement and thus hardly affected the results, details of the client machine are not described here. 1.2 Early results The measurement results shown here are latency average, latency stddev, and throughput (tps), which are the output of the pgbench program. 1.2.1 synchronous_commit = on (a) 9.5alpha2(original) latency average: 2.852 ms latency stddev: 6.010 ms tps = 42025.789717 (including connections establishing) tps = 42026.137247 (excluding connections establishing) (b) 9.5alpha2 with partitioned checkpointing latency average: 2.815 ms latency stddev: 2.317 ms tps = 42575.301137 (including connections establishing) tps = 42575.677907 (excluding connections establishing) 1.2.2 synchronous_commit = off (a) 9.5alpha2(original) latency average: 2.136 ms latency stddev: 5.422 ms tps = 44870.897907 (including connections establishing) tps = 44871.215792 (excluding connections establishing) (b) 9.5alpha2 with partitioned checkpointing latency average: 2.085 ms latency stddev: 1.529 ms tps = 45974.617724 (including connections establishing) tps = 45974.973604 (excluding connections establishing) 1.3 Summary The partitioned checkpointing produced great improvement (reduction) in latency stddev and slight improvement in latency average and tps; there was no performance degradation. Therefore, there is an effect to stabilize the operation in this partitioned checkpointing. In fact, the throughput variation, obtained by -P 1 option, shows that the dips were mitigated in both magnitude and frequency. # Since I'm not sure whether it is OK to send an email to this mailing with attaching some files other than patch, I refrain now from attaching raw results (200K bytes of text/case) and result graphs in .jpg or .epsf format illustrating the throughput variations to this email. If it is OK, I'm pleased to show the results in those formats. 2. Mechanism Imaginably, 'partitioned checkpointing' conducts buffer sync not for all buffers at once but for the buffers belonging to one partition at one invocation of the checkpointer. In the following description, the number of partitions is expressed by N. (N is fixed to 16 in the attached patch). 2.1 Principles of operations In order to preserve the semantics of the traditional checkpointing, the checkpointer invocation interval is changed to checkpoint_timeout / N. The checkpointer carries out the buffer sync for the buffer partition 0 at the first invocation, and then for the buffer partition 1 at the second invocation, and so on. When the turn of the the buffer partition N-1 comes, i.e. the last round of a series of buffer sync, the checkpointer carries out the buffer sync for the buffer partition and other usual checkpoint operations, coded in CheckPointGuts() in xlog.c. The principle is that, roughly speaking, 1) checkpointing for the buffer partition 0 corresponds to the beginning of the traditional checkpointing, where the XLOG location (LSN) is obtained and set to RedoRecPtr, and 2) checkpointing for the buffer partition N - 1 corresponds to the end of the traditional checkpointing, where the WAL files that are no longer need (up to the previous log segment of that specified by the RedoRecPtr value) are deleted or recycled. A role of RedoRecPtr indicating the threshold to determine whether FPW is necessary or not is moved to a new N-element array of XLogRecPtr, as the threshold for each buffer is different among partitions. The n-th element of the array is updated when the buffer sync for partition n is carried out. 2.2 Drawbacks The 'partitioned checkpointing' works effectively in such situation that the checkpointer is invoked by hitting the checkpoint_timeout; performance dip is mitigated and the WAL size is not changed (in avarage). On the other hand, when the checkpointer is invoked by another trigger event than timeout, traditional checkpoint procedure which syncs all buffers at once will take place, resulting in performance dip. Also the WAL size for that checkpoint period (until the next invocation of the checkpointer) will theoritically increase to 1.5 times of that of usual case because of the increase in the FPW. My opinion is that this is not serious because it is preferable for checkpointer to be invoked by the timeout, and thus usual systems are supposed to be tuned to work under such condition that is prefarable for the 'partitioned checkpointing.' 3. Conclusion The 'partitioned checkpointing' mechanism is expected to be effective for mitigating the performance dip due to checkpoint. In particular, it is noteworthy that the effect was observed on a server machine that use SSDs for $PGDATA, for which seek optimizations are not believed effective. Therefore, this mechanism is worth to further investigation aiming to implement in future PostgreSQL. -- Takashi Horikawa NEC Corporation Knowledge Discovery Research Laboratories
Description: Binary data
Description: S/MIME cryptographic signature