Hello pg-devs,
This patch is a simplified and generalized version of Andres Freund's August 2014 patch for flushing while writing during checkpoints, with some documentation and configuration warnings added.
For the initial patch, see: http://www.postgresql.org/message-id/20140827091922.gd21...@awork2.anarazel.de For the whole thread: http://www.postgresql.org/message-id/alpine.DEB.2.10.1408251900211.11151@stoThe objective is to help avoid PG stalling when fsyncing on checkpoints, and in general to get better latency-bound performance.
Flushes are managed with pg throttled writes instead of waiting for the checkpointer final "fsync" which induces occasional stalls. From "pgbench -P 1 ...", such stalls look like this: progress: 35.0 s, 615.9 tps, lat 1.344 ms stddev 4.043 # ok progress: 36.0 s, 3.0 tps, lat 346.111 ms stddev 123.828 # stalled progress: 37.0 s, 4.0 tps, lat 252.462 ms stddev 29.346 # ... progress: 38.0 s, 161.0 tps, lat 6.968 ms stddev 32.964 # restart progress: 39.0 s, 701.0 tps, lat 1.421 ms stddev 3.326 # ok I've seen similar behavior on FreeBSD with its native FS, so it is not a Linux-specific or ext4-specific issue, even if both factor may contribute. There are two implementations, first one based on "sync_file_range" is Linux specific, while the other relies on "posix_fadvise". Tests below ran on Linux. If someone could test the posix_fadvise version on relevant platforms, that would be great... The Linux specific "sync_file_range" approach was suggested among other ideas by Theodore Ts'o on Robert Haas blog in March 2014: http://rhaas.blogspot.fr/2014/03/linuxs-fsync-woes-are-getting-some.htmlTwo guc variables control whether the feature is activated for writes of dirty pages issued by checkpointer and bgwriter. Given that the settings may improve or degrade performance, having GUC seems justified. In particular the stalling issue disappears with SSD.
The effect is significant on a series of tests shown below with scale 10 pgbench on an (old) dedicated host (8 GB memory, 8 cores, ext4 over hw RAID), with shared_buffers=1GB checkpoint_completion_target=0.8 completion_timeout=30s, unless stated otherwise.
Note: I know that this completion_timeout is too small for a normal config, but the point is to test how checkpoints behave, so the test triggers as many checkpoints as possible, hence the minimum timeout setting. I have also done some tests with larger timeout.
(1) THROTTLED PGBENCH The objective of the patch is to be able to reduce the latency of transactions under a moderate load. These first serie of tests focuses on this point with the help of pgbench -R (rate) and -L (skip/count late transactions). The measure counts transactions which were skipped or beyond the expected latency limit while targetting a transaction rate. * "pgbench -M prepared -N -T 100 -P 1 -R 100 -L 100" (100 tps targeted during 100 seconds, and latency limit is 100 ms), over 256 runs, 7 hours per case: flush | percent of skipped cp | bgw | & out of latency limit transactions off | off | 6.5 % off | on | 6.1 % on | off | 0.4 % on | on | 0.4 % * Same as above (100 tps target) over one run of 4000 seconds with shared_buffers=256MB and checkpoint_timeout=10mn: flush | percent of skipped cp | bgw | & out of latency limit transactions off | off | 1.3 % off | on | 1.5 % on | off | 0.6 % on | on | 0.6 % * Same as first one but with "-R 150", i.e. targetting 150 tps, 256 runs: flush | percent of skipped cp | bgw | & out of latency limit transactions off | off | 8.0 % off | on | 8.0 % on | off | 0.4 % on | on | 0.4 % * Same as above (150 tps target) over one run of 4000 seconds with shared_buffers=256MB and checkpoint_timeout=10mn: flush | percent of skipped cp | bgw | & out of latency limit transactions off | off | 1.7 % off | on | 1.9 % on | off | 0.7 % on | on | 0.6 % Turning "checkpoint_flush_to_disk = on" reduces significantly the number of late transactions. These late transactions are not uniformly distributed, but are rather clustered around times when pg is stalled, i.e. more or less unresponsive.bgwriter_flush_to_disk does not seem to have a significant impact on these tests, maybe because pg shared_buffers size is much larger than the database, so the bgwriter is seldom active.
(2) FULL SPEED PGBENCHThis is not the target use case, but it seems necessary to assess the impact of these options of tps figures and their variability.
* "pgbench -M prepared -N -T 100 -P 1" over 512 runs, 14 hours per case. flush | performance on ... cp | bgw | 512 100-seconds runs | 1s intervals (over 51200 seconds) off | off | 691 +- 36 tps | 691 +- 236 tps off | on | 677 +- 29 tps | 677 +- 230 tps on | off | 655 +- 23 tps | 655 +- 130 tps on | on | 657 +- 22 tps | 657 +- 130 tps On this first test, setting checkpoint_flush_to_disk reduces the performance by 5%, but the per second standard deviation is nearly halved, that is the performance is more stable over the runs, although lower. Option bgwriter_flush_to_disk effect is inconclusive. * "pgbench -M prepared -N -T 4000 -P 1" on only 1 (long) run, with checkpoint_timeout=10mn and shared_buffers=256MB (at least 6 checkpoints during the run, probably more because segments are filled more often than every 10mn): flush | performance ... (stddev over per second tps) off | off | 877 +- 179 tps off | on | 880 +- 183 tps on | off | 896 +- 131 tps on | on | 888 +- 132 tpsOn this second short test, setting checkpoint_flush_to_disk seems to maybe slightly improve performance (maybe 2% ?) and significantly reduces variability, so it looks like a good move.
* "pgbench -M prepared -N -T 100 -j 2 -c 4 -P 1" over 32 runs (4 clients) flush | performance on ... cp | bgw | 32 100-seconds runs | 1s intervals (over 3200 seconds) off | off | 1970 +- 60 tps | 1970 +- 783 tps off | on | 1928 +- 61 tps | 1928 +- 813 tps on | off | 1578 +- 45 tps | 1578 +- 631 tps on | on | 1594 +- 47 tps | 1594 +- 618 tps On this test both average and standard deviation are both reduced by 20%. This does not look like a win. CONCLUSION This approach is simple and significantly improves pg fsync behavior under moderate load, where the database stays mostly responsive. Under full load, the situation may be improved or degraded, it depends. OTHER OPTIONSAnother idea suggested by Theodore Ts'o seems impractical: playing with Linux io-scheduler priority (ioprio_set) looks only relevant with the "sfq" scheduler on actual hard disk, but does not work with other schedulers, especially "deadline" which seems more advisable for Pg, nor for hardware RAID, which is a common setting.
Also, Theodore Ts'o suggested to use "sync_file_range" to check whether the writes have reached the disk, and possibly to delay the actual fsync/checkpoint conclusion if not... I have not tried that, the implementation is not as trivial, and I'm not sure what to do when the completion target is coming, but possibly that could be an interesting option to investigate. Preliminary tests by adding a sleep between the writes and the final fsync did not yield very good results.
I've also played with numerous other options (changing checkpointer throttling parameters, reducing checkpoint timeout to 1 second, playing around with various kernel settings), but that did not seem to be very effective for the problem at hand.
I also attached a test script I used, that can be adapted if someone wants to collect some performance data. I also have some basic scripts to extract and compute stats, ask if needed.
-- Fabien.
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 5549b7d..1c0a3a1 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1818,6 +1818,24 @@ include_dir 'conf.d' </listitem> </varlistentry> + <variablelist> + <varlistentry id="guc-bgwriter-flush-to-disk" xreflabel="bgwriter_flush_to_disk"> + <term><varname>bgwriter_flush_to_disk</varname> (<type>bool</type>) + <indexterm> + <primary><varname>bgwriter_flush_to_disk</> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + When the bgwriter writes data, hint the underlying OS that the data + must be sent to disk as soon as possible. This may help smoothing + disk IO writes and avoid a stall when an fsync is issued by a + checkpoint, but it may also reduce average performance. + This setting may have no effect on some platforms. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-bgwriter-lru-maxpages" xreflabel="bgwriter_lru_maxpages"> <term><varname>bgwriter_lru_maxpages</varname> (<type>integer</type>) <indexterm> @@ -2495,6 +2513,23 @@ include_dir 'conf.d' </listitem> </varlistentry> + <varlistentry id="guc-checkpoint-flush-to-disk" xreflabel="checkpoint_flush_to_disk"> + <term><varname>checkpoint_flush_to_disk</varname> (<type>bool</type>) + <indexterm> + <primary><varname>checkpoint_flush_to_disk</> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + When writing data for a checkpoint, hint the underlying OS that the + data must be sent to disk as soon as possible. This may help smoothing + disk IO writes and avoid a stall when fsync is issued at the end of + the checkpoint, but it may also reduce average performance. + This setting may have no effect on some platforms. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-min-wal-size" xreflabel="min_wal_size"> <term><varname>min_wal_size</varname> (<type>integer</type>) <indexterm> diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml index f4083c3..cdbdca9 100644 --- a/doc/src/sgml/wal.sgml +++ b/doc/src/sgml/wal.sgml @@ -546,6 +546,15 @@ </para> <para> + On Linux and POSIX platforms, <xref linkend="guc-checkpoint-flush-to-disk"> + allows to hint the OS that pages written on checkpoints must be flushed + to disk quickly. Otherwise, these pages may be kept in cache for some time, + inducing a stall later when <literal>fsync</> is called to actually + complete the checkpoint. This setting helps to reduce transaction latency, + but it also has an adverse effect on the average transaction rate. + </para> + + <para> The number of WAL segment files in <filename>pg_xlog</> directory depends on <varname>min_wal_size</>, <varname>max_wal_size</> and the amount of WAL generated in previous checkpoint cycles. When old log diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c index bcce3e3..2d5c873 100644 --- a/src/backend/access/heap/rewriteheap.c +++ b/src/backend/access/heap/rewriteheap.c @@ -918,7 +918,7 @@ logical_heap_rewrite_flush_mappings(RewriteState state) * Note that we deviate from the usual WAL coding practices here, * check the above "Logical rewrite support" comment for reasoning. */ - written = FileWrite(src->vfd, waldata_start, len); + written = FileWrite(src->vfd, waldata_start, len, false); if (written != len) ereport(ERROR, (errcode_for_file_access(), diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c index 9431ab5..3375032 100644 --- a/src/backend/access/nbtree/nbtree.c +++ b/src/backend/access/nbtree/nbtree.c @@ -203,7 +203,7 @@ btbuildempty(PG_FUNCTION_ARGS) /* Write the page. If archiving/streaming, XLOG it. */ PageSetChecksumInplace(metapage, BTREE_METAPAGE); smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE, - (char *) metapage, true); + (char *) metapage, true, false); if (XLogIsNeeded()) log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, BTREE_METAPAGE, metapage, false); diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c index f95f67a..ae8c1ca 100644 --- a/src/backend/access/nbtree/nbtsort.c +++ b/src/backend/access/nbtree/nbtsort.c @@ -315,7 +315,7 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno) { /* overwriting a block we zero-filled before */ smgrwrite(wstate->index->rd_smgr, MAIN_FORKNUM, blkno, - (char *) page, true); + (char *) page, true, false); } pfree(page); diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c index bceee8d..242af8f 100644 --- a/src/backend/access/spgist/spginsert.c +++ b/src/backend/access/spgist/spginsert.c @@ -170,7 +170,7 @@ spgbuildempty(PG_FUNCTION_ARGS) /* Write the page. If archiving/streaming, XLOG it. */ PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO); smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO, - (char *) page, true); + (char *) page, true, false); if (XLogIsNeeded()) log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO, page, false); @@ -180,7 +180,7 @@ spgbuildempty(PG_FUNCTION_ARGS) PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO); smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_ROOT_BLKNO, - (char *) page, true); + (char *) page, true, false); if (XLogIsNeeded()) log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, SPGIST_ROOT_BLKNO, page, true); @@ -190,7 +190,7 @@ spgbuildempty(PG_FUNCTION_ARGS) PageSetChecksumInplace(page, SPGIST_NULL_BLKNO); smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_NULL_BLKNO, - (char *) page, true); + (char *) page, true, false); if (XLogIsNeeded()) log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, SPGIST_NULL_BLKNO, page, true); diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index cc973b5..3e19ebc 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -80,6 +80,9 @@ bool zero_damaged_pages = false; int bgwriter_lru_maxpages = 100; double bgwriter_lru_multiplier = 2.0; bool track_io_timing = false; +/* hint to move writes to high priority */ +bool checkpoint_flush_to_disk = false; +bool bgwriter_flush_to_disk = false; /* * How many buffers PrefetchBuffer callers should try to stay ahead of their @@ -396,7 +399,8 @@ static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy); static void PinBuffer_Locked(volatile BufferDesc *buf); static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner); static void BufferSync(int flags); -static int SyncOneBuffer(int buf_id, bool skip_recently_used); +static int SyncOneBuffer(int buf_id, bool skip_recently_used, + bool flush_to_disk); static void WaitIO(volatile BufferDesc *buf); static bool StartBufferIO(volatile BufferDesc *buf, bool forInput); static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty, @@ -409,7 +413,7 @@ static volatile BufferDesc *BufferAlloc(SMgrRelation smgr, BlockNumber blockNum, BufferAccessStrategy strategy, bool *foundPtr); -static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln); +static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln, bool flush_to_disk); static void AtProcExit_Buffers(int code, Datum arg); static void CheckForBufferLeaks(void); static int rnode_comparator(const void *p1, const void *p2); @@ -1018,7 +1022,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, smgr->smgr_rnode.node.dbNode, smgr->smgr_rnode.node.relNode); - FlushBuffer(buf, NULL); + FlushBuffer(buf, NULL, false); LWLockRelease(buf->content_lock); TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum, @@ -1662,7 +1666,7 @@ BufferSync(int flags) */ if (bufHdr->flags & BM_CHECKPOINT_NEEDED) { - if (SyncOneBuffer(buf_id, false) & BUF_WRITTEN) + if (SyncOneBuffer(buf_id, false, checkpoint_flush_to_disk) & BUF_WRITTEN) { TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id); BgWriterStats.m_buf_written_checkpoints++; @@ -1939,7 +1943,7 @@ BgBufferSync(void) /* Execute the LRU scan */ while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est) { - int buffer_state = SyncOneBuffer(next_to_clean, true); + int buffer_state = SyncOneBuffer(next_to_clean, true, bgwriter_flush_to_disk); if (++next_to_clean >= NBuffers) { @@ -2016,7 +2020,7 @@ BgBufferSync(void) * Note: caller must have done ResourceOwnerEnlargeBuffers. */ static int -SyncOneBuffer(int buf_id, bool skip_recently_used) +SyncOneBuffer(int buf_id, bool skip_recently_used, bool flush_to_disk) { volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id); int result = 0; @@ -2057,7 +2061,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used) PinBuffer_Locked(bufHdr); LWLockAcquire(bufHdr->content_lock, LW_SHARED); - FlushBuffer(bufHdr, NULL); + FlushBuffer(bufHdr, NULL, flush_to_disk); LWLockRelease(bufHdr->content_lock); UnpinBuffer(bufHdr, true); @@ -2319,9 +2323,12 @@ BufferGetTag(Buffer buffer, RelFileNode *rnode, ForkNumber *forknum, * * If the caller has an smgr reference for the buffer's relation, pass it * as the second parameter. If not, pass NULL. + * + * The third parameter tries to hint the OS that a high priority write is meant, + * possibly because io-throttling is already managed elsewhere. */ static void -FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln) +FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln, bool flush_to_disk) { XLogRecPtr recptr; ErrorContextCallback errcallback; @@ -2410,7 +2417,8 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln) buf->tag.forkNum, buf->tag.blockNum, bufToWrite, - false); + false, + flush_to_disk); if (track_io_timing) { @@ -2830,6 +2838,7 @@ FlushRelationBuffers(Relation rel) bufHdr->tag.forkNum, bufHdr->tag.blockNum, localpage, + false, false); bufHdr->flags &= ~(BM_DIRTY | BM_JUST_DIRTIED); @@ -2864,7 +2873,7 @@ FlushRelationBuffers(Relation rel) { PinBuffer_Locked(bufHdr); LWLockAcquire(bufHdr->content_lock, LW_SHARED); - FlushBuffer(bufHdr, rel->rd_smgr); + FlushBuffer(bufHdr, rel->rd_smgr, false); LWLockRelease(bufHdr->content_lock); UnpinBuffer(bufHdr, true); } @@ -2916,7 +2925,7 @@ FlushDatabaseBuffers(Oid dbid) { PinBuffer_Locked(bufHdr); LWLockAcquire(bufHdr->content_lock, LW_SHARED); - FlushBuffer(bufHdr, NULL); + FlushBuffer(bufHdr, NULL, false); LWLockRelease(bufHdr->content_lock); UnpinBuffer(bufHdr, true); } diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c index 3144afe..156539d 100644 --- a/src/backend/storage/buffer/localbuf.c +++ b/src/backend/storage/buffer/localbuf.c @@ -208,6 +208,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum, bufHdr->tag.forkNum, bufHdr->tag.blockNum, localpage, + false, false); /* Mark not-dirty now in case we error out below */ diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c index ea4d689..132cc43 100644 --- a/src/backend/storage/file/buffile.c +++ b/src/backend/storage/file/buffile.c @@ -317,7 +317,7 @@ BufFileDumpBuffer(BufFile *file) return; /* seek failed, give up */ file->offsets[file->curFile] = file->curOffset; } - bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite); + bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite, false); if (bytestowrite <= 0) return; /* failed to write */ file->offsets[file->curFile] += bytestowrite; diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 1ba4946..717e772 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -1345,7 +1345,7 @@ retry: } int -FileWrite(File file, char *buffer, int amount) +FileWrite(File file, char *buffer, int amount, bool flush_to_disk) { int returnCode; @@ -1395,6 +1395,55 @@ retry: if (returnCode >= 0) { + +#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE) + + /* + * Calling "write" tells the OS that pg wants to write some page to disk, + * however when it is really done is chosen by the OS. + * Depending on other disk activities this may be delayed significantly, + * maybe up to an "fsync" call, which could induce an IO write surge. + * When checkpointing pg is doing its own throttling and the result + * should really be written to disk with high priority, so as to meet + * the completion target. + * This call hints that such write have a higher priority. + */ + if (flush_to_disk && returnCode == amount && errno == 0) + { + int rc; + +#if defined(HAVE_SYNC_FILE_RANGE) + + /* Linux: tell the memory manager to move these blocks to io so + * that they are considered for being actually written to disk. + */ + rc = sync_file_range(VfdCache[file].fd, VfdCache[file].seekPos, + amount, SYNC_FILE_RANGE_WRITE); + +#elif defined(HAVE_POSIX_FADVISE) + + /* Others: say that data should not be kept in memory... + * This is not exactly what we want to say, because we want to write + * the data for durability but we may need it later nevertheless. + * It seems that Linux would free the memory *if* the data has + * already been written do disk, else it is ignored. + * For FreeBSD this may have the desired effect of moving the + * data to the io layer. + */ + rc = posix_fadvise(VfdCache[file].fd, VfdCache[file].seekPos, + amount, POSIX_FADV_DONTNEED); + +#endif + + if (rc < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not flush block " INT64_FORMAT " in file \"%s\": %m", + (int64) VfdCache[file].seekPos / BLCKSZ, + VfdCache[file].fileName))); + } +#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */ + VfdCache[file].seekPos += returnCode; /* maintain fileSize and temporary_files_size if it's a temp file */ diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 42a43bb..5c50e19 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -531,7 +531,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, errmsg("could not seek to block %u in file \"%s\": %m", blocknum, FilePathName(v->mdfd_vfd)))); - if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ)) != BLCKSZ) + if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, false)) != BLCKSZ) { if (nbytes < 0) ereport(ERROR, @@ -738,7 +738,7 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, */ void mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, - char *buffer, bool skipFsync) + char *buffer, bool skipFsync, bool flush_to_disk) { off_t seekpos; int nbytes; @@ -767,7 +767,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, errmsg("could not seek to block %u in file \"%s\": %m", blocknum, FilePathName(v->mdfd_vfd)))); - nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ); + nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, flush_to_disk); TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum, reln->smgr_rnode.node.spcNode, diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index 244b4ea..199695d 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -52,7 +52,8 @@ typedef struct f_smgr void (*smgr_read) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, char *buffer); void (*smgr_write) (SMgrRelation reln, ForkNumber forknum, - BlockNumber blocknum, char *buffer, bool skipFsync); + BlockNumber blocknum, char *buffer, bool skipFsync, + bool flush_to_disk); BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum); void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks); @@ -643,10 +644,10 @@ smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, */ void smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, - char *buffer, bool skipFsync) + char *buffer, bool skipFsync, bool flush_to_disk) { (*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum, - buffer, skipFsync); + buffer, skipFsync, flush_to_disk); } /* diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index b3c9f14..0b5ca17 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -158,6 +158,7 @@ static bool check_bonjour(bool *newval, void **extra, GucSource source); static bool check_ssl(bool *newval, void **extra, GucSource source); static bool check_stage_log_stats(bool *newval, void **extra, GucSource source); static bool check_log_stats(bool *newval, void **extra, GucSource source); +static bool check_flush_to_disk(bool *newval, void **extra, GucSource source); static bool check_canonical_path(char **newval, void **extra, GucSource source); static bool check_timezone_abbreviations(char **newval, void **extra, GucSource source); static void assign_timezone_abbreviations(const char *newval, void *extra); @@ -569,6 +570,8 @@ const char *const config_group_names[] = gettext_noop("Write-Ahead Log / Checkpoints"), /* WAL_ARCHIVING */ gettext_noop("Write-Ahead Log / Archiving"), + /* BGWRITER */ + gettext_noop("Background Writer"), /* REPLICATION */ gettext_noop("Replication"), /* REPLICATION_SENDING */ @@ -1009,6 +1012,27 @@ static struct config_bool ConfigureNamesBool[] = false, NULL, NULL, NULL }, + + { + {"checkpoint_flush_to_disk", PGC_SIGHUP, WAL_CHECKPOINTS, + gettext_noop("Hint that checkpoint's writes are high priority."), + NULL + }, + &checkpoint_flush_to_disk, + false, + check_flush_to_disk, NULL, NULL + }, + + { + {"bgwriter_flush_to_disk", PGC_SIGHUP, BGWRITER, + gettext_noop("Hint that bgwriter's writes are high priority."), + NULL + }, + &bgwriter_flush_to_disk, + false, + check_flush_to_disk, NULL, NULL + }, + { {"log_connections", PGC_SU_BACKEND, LOGGING_WHAT, gettext_noop("Logs each successful connection."), @@ -9761,6 +9785,22 @@ check_log_stats(bool *newval, void **extra, GucSource source) } static bool +check_flush_to_disk(bool *newval, void **extra, GucSource source) +{ +/* This test must be consistent with the one in FileWrite (storage/file/fd.c) + */ +#if ! (defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)) + /* just warn if it has no effect */ + ereport(WARNING, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("Setting \"checkpoint_flush_to_disk\" or " + "\"bgwriter_flush_to_disk\" has no effect " + "on this platform."))); +#endif /* HAVE_SYNC_FILE_RANGE */ + return true; +} + +static bool check_canonical_path(char **newval, void **extra, GucSource source) { /* diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index ec0a254..4fea196 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -54,6 +54,8 @@ extern int bgwriter_lru_maxpages; extern double bgwriter_lru_multiplier; extern bool track_io_timing; extern int target_prefetch_pages; +extern bool checkpoint_flush_to_disk; +extern bool bgwriter_flush_to_disk; /* in buf_init.c */ extern PGDLLIMPORT char *BufferBlocks; diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h index 7eabe09..32ac80f 100644 --- a/src/include/storage/fd.h +++ b/src/include/storage/fd.h @@ -70,7 +70,7 @@ extern File OpenTemporaryFile(bool interXact); extern void FileClose(File file); extern int FilePrefetch(File file, off_t offset, int amount); extern int FileRead(File file, char *buffer, int amount); -extern int FileWrite(File file, char *buffer, int amount); +extern int FileWrite(File file, char *buffer, int amount, bool flush_to_disk); extern int FileSync(File file); extern off_t FileSeek(File file, off_t offset, int whence); extern int FileTruncate(File file, off_t offset); diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h index 69a624f..0bf0886 100644 --- a/src/include/storage/smgr.h +++ b/src/include/storage/smgr.h @@ -95,7 +95,7 @@ extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum, extern void smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, char *buffer); extern void smgrwrite(SMgrRelation reln, ForkNumber forknum, - BlockNumber blocknum, char *buffer, bool skipFsync); + BlockNumber blocknum, char *buffer, bool skipFsync, bool flush_to_disk); extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum); extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks); @@ -121,7 +121,7 @@ extern void mdprefetch(SMgrRelation reln, ForkNumber forknum, extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, char *buffer); extern void mdwrite(SMgrRelation reln, ForkNumber forknum, - BlockNumber blocknum, char *buffer, bool skipFsync); + BlockNumber blocknum, char *buffer, bool skipFsync, bool flush_to_disk); extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum); extern void mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks); diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h index 7a58ddb..b69af2d 100644 --- a/src/include/utils/guc_tables.h +++ b/src/include/utils/guc_tables.h @@ -68,6 +68,7 @@ enum config_group WAL_SETTINGS, WAL_CHECKPOINTS, WAL_ARCHIVING, + BGWRITER, REPLICATION, REPLICATION_SENDING, REPLICATION_MASTER,
cp_test.sh
Description: Bourne shell script
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers