The attached patch adds a new field to pg_stat_bgwriter, counting the number of times backends execute their own fsync calls. Normally, when a backend needs to fsync data, it passes a request to the background writer, which then absorbs the call into its own queue of work to do. However, under some types of heavy system load, the associated queue can fill. When this happens, backends are forced to do their own fsync call. This is potentially much worse than when they do a regular write.

The really nasty situation is when the background writer is busy because it's executing a checkpoint. In that case, it's possible for the backend fsync calls to start competing with the ones the background writer is trying to get done, causing the checkpoint sync phase to execute slower than it should. I've seen the sync phase take over 45 minutes on a really busy server once it got into this condition, where hundreds of clients doing their own backend fsync calls were fighting against the checkpoint fsync work. With this patch, you can observe that happening as an upwards spike in pg_stat_bgwriter.buffers_backend_sync, which as documented is an inclusive subset of the total shown in buffers_backend.

While it takes a busier system than I can useful show how to simulate here to show a really bad situation, I'm able to see some of these unabsorbed backend fsync calls when initializing a pgbench database, to prove they happen in the lab. The attached test program takes as its input a pgbench scale counter. It then creates a pgbench database (deleting any existing pgbench database, so watch out for that) and shows the values accumulated in pg_stat_bgwriter during that period. Here's an example, using the script's default scale of 100 on a server with 8GB of RAM and fake fsync (the hard drives are lying about it):

-[ RECORD 1 ]--------+-----------------------------
now                  | 2010-11-14 16:08:41.36421-05
...
Initializing pgbench
-[ RECORD 1 ]--------+------------------------------
now                  | 2010-11-14 16:09:46.713693-05
checkpoints_timed    | 0
checkpoints_req      | 0
buffers_checkpoint   | 0
buffers_clean        | 0
maxwritten_clean     | 0
buffers_backend      | 654716
buffers_backend_sync | 90
buffers_alloc        | 803

This is with default sizing for memory structures. As you increase shared_buffers, one of the queues involved here increases proportionately, making it less likely to run into this problem. That just changes it to the kind of problem I've only seen on a larger system with a difficult to simulate workload. The production system getting hammered with this problem (running a web application) that prompted writing the patch had shared_buffers=4GB at the time.

The patch also adds some logging to the innards involved here, to help with understanding problems in this area. I don't think that should be in the version committed as is. May want to drop the logging level or make it disabled in regular builds, since it is sitting somewhere it generates a lot of log data and adds overhead. It is nice for now, as it lets you get an idea how much fsync work *is* being absorbed by the BGW, as well as showing what relation is suffering from this issue. Example of both those things, with the default config for everything except log_checkpoints (on) and log_min_messages (debug1):

DEBUG:  Absorbing 4096 fsync requests
DEBUG:  Absorbing 150 fsync requests
DEBUG:  Unable to forward fsync request, executing directly
CONTEXT:  writing block 158638 of relation base/16385/16398

Here 4096 is the most entries the BGW will ever absorb at once, and all 90 of the missed sync calls are logged so you can see what files they came from.

As a high-level commentary about this patch, I'm not sure what most end users will ever do with this data. At the same time, I wasn't sure what a typical end user would do with anything else in pg_stat_bgwriter either when it was added, and it turns out the answer is "wait for people who know the internals to write things that monitor it". For example, Magnus has updated recent versions of the Munin plug-in for PostgreSQL to usefully graph pg_stat_bgwriter data over time. As I have some data to suggest checkpoint problems on Linux in particular are getting worse as total system memory increases, I expect that having a way to easily instrument for this particular problem will be correspondingly more important in the future too.

--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us


diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index dbf966b..f701b8f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -264,8 +264,10 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
       and the number of times the background writer stopped a cleaning scan
       because it had written too many buffers.  Also includes
       statistics about the shared buffer pool, including buffers written
-      by backends (that is, not by the background writer) and total buffers
-      allocated.
+      by backends (that is, not by the background writer), how many times
+      those backends had to execute their own filesystem synchronize calls
+      (normally the background writer handles those even when the backend
+      writes itself), and total buffers allocated.
      </entry>
      </row>
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 09574c3..8766ccc 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -537,6 +537,7 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
         pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
         pg_stat_get_buf_written_backend() AS buffers_backend,
+        pg_stat_get_buf_sync_backend() AS buffers_backend_sync,
         pg_stat_get_buf_alloc() AS buffers_alloc;
 
 CREATE VIEW pg_user_mappings AS
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 0690ab5..43a149e 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -102,13 +102,15 @@
  *
  * num_backend_writes is used to count the number of buffer writes performed
  * by non-bgwriter processes.  This counter should be wide enough that it
- * can't overflow during a single bgwriter cycle.
+ * can't overflow during a single bgwriter cycle.  num_backend_fsync
+ * counts the subset of those writes that also had to do their own fsync,
+ * because the background writer failed to absorb their request.
  *
  * The requests array holds fsync requests sent by backends and not yet
  * absorbed by the bgwriter.
  *
- * Unlike the checkpoint fields, num_backend_writes and the requests
- * fields are protected by BgWriterCommLock.
+ * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
+ * the requests fields are protected by BgWriterCommLock.
  *----------
  */
 typedef struct
@@ -132,6 +134,7 @@ typedef struct
 	int			ckpt_flags;		/* checkpoint flags, as defined in xlog.h */
 
 	uint32		num_backend_writes;		/* counts non-bgwriter buffer writes */
+	uint32		num_backend_fsync;		/* counts non-bgwriter fsync calls */
 
 	int			num_requests;	/* current # of requests */
 	int			max_requests;	/* allocated array size */
@@ -1084,12 +1087,14 @@ ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
 
 	LWLockAcquire(BgWriterCommLock, LW_EXCLUSIVE);
 
-	/* we count non-bgwriter writes even when the request queue overflows */
+	/* Count all backend writes regardless of if they fit in the queue */
 	BgWriterShmem->num_backend_writes++;
 
 	if (BgWriterShmem->bgwriter_pid == 0 ||
 		BgWriterShmem->num_requests >= BgWriterShmem->max_requests)
 	{
+		/* Also count the subset where backends have to do their own fsync */
+		BgWriterShmem->num_backend_fsync++;
 		LWLockRelease(BgWriterCommLock);
 		return false;
 	}
@@ -1137,7 +1142,10 @@ AbsorbFsyncRequests(void)
 
 	/* Transfer write count into pending pgstats message */
 	BgWriterStats.m_buf_written_backend += BgWriterShmem->num_backend_writes;
+	BgWriterStats.m_buf_fsync_backend += BgWriterShmem->num_backend_fsync;
+
 	BgWriterShmem->num_backend_writes = 0;
+	BgWriterShmem->num_backend_fsync = 0;
 
 	n = BgWriterShmem->num_requests;
 	if (n > 0)
@@ -1149,6 +1157,9 @@ AbsorbFsyncRequests(void)
 
 	LWLockRelease(BgWriterCommLock);
 
+	if (n > 0)
+		elog(DEBUG1,"Absorbing %d fsync requests",n);
+
 	for (request = requests; n > 0; request++, n--)
 		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index a1ad342..c3c136a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4188,6 +4188,7 @@ pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 	globalStats.buf_written_clean += msg->m_buf_written_clean;
 	globalStats.maxwritten_clean += msg->m_maxwritten_clean;
 	globalStats.buf_written_backend += msg->m_buf_written_backend;
+	globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
 	globalStats.buf_alloc += msg->m_buf_alloc;
 }
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index c72da72..7140b94 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1269,6 +1269,8 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 		if (ForwardFsyncRequest(reln->smgr_rnode, forknum, seg->mdfd_segno))
 			return;				/* passed it off successfully */
 
+		elog(DEBUG1, "Unable to forward fsync request, executing directly");
+
 		if (FileSync(seg->mdfd_vfd) < 0)
 			ereport(ERROR,
 					(errcode_for_file_access(),
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6edb8bf..2104b53 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -78,6 +78,7 @@ extern Datum pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_buf_sync_backend(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_buf_alloc(PG_FUNCTION_ARGS);
 
 extern Datum pg_stat_get_xact_numscans(PG_FUNCTION_ARGS);
@@ -1165,6 +1166,12 @@ pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
 }
 
 Datum
+pg_stat_get_buf_sync_backend(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_INT64(pgstat_fetch_global()->buf_fsync_backend);
+}
+
+Datum
 pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 {
 	PG_RETURN_INT64(pgstat_fetch_global()->buf_alloc);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 4f444ae..7fc6f05 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3117,6 +3117,8 @@ DATA(insert OID = 2773 ( pg_stat_get_bgwriter_maxwritten_clean PGNSP PGUID 12 1
 DESCR("statistics: number of times the bgwriter stopped processing when it had written too many buffers while cleaning");
 DATA(insert OID = 2775 ( pg_stat_get_buf_written_backend PGNSP PGUID 12 1 0 0 f f f t f s 0 0 20 "" _null_ _null_ _null_ _null_ pg_stat_get_buf_written_backend _null_ _null_ _null_ ));
 DESCR("statistics: number of buffers written by backends");
+DATA(insert OID = 3063 ( pg_stat_get_buf_sync_backend PGNSP PGUID 12 1 0 0 f f f t f s 0 0 20 "" _null_ _null_ _null_ _null_ pg_stat_get_buf_sync_backend _null_ _null_ _null_ ));
+DESCR("statistics: number of backend buffer writtes that did their own sync");
 DATA(insert OID = 2859 ( pg_stat_get_buf_alloc			PGNSP PGUID 12 1 0 0 f f f t f s 0 0 20 "" _null_ _null_ _null_ _null_ pg_stat_get_buf_alloc _null_ _null_ _null_ ));
 DESCR("statistics: number of buffer allocations");
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index d49bd80..42bf9c4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -360,6 +360,7 @@ typedef struct PgStat_MsgBgWriter
 	PgStat_Counter m_buf_written_clean;
 	PgStat_Counter m_maxwritten_clean;
 	PgStat_Counter m_buf_written_backend;
+	PgStat_Counter m_buf_fsync_backend;
 	PgStat_Counter m_buf_alloc;
 } PgStat_MsgBgWriter;
 
@@ -562,6 +563,7 @@ typedef struct PgStat_GlobalStats
 	PgStat_Counter buf_written_clean;
 	PgStat_Counter maxwritten_clean;
 	PgStat_Counter buf_written_backend;
+	PgStat_Counter buf_fsync_backend;
 	PgStat_Counter buf_alloc;
 } PgStat_GlobalStats;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index c7a796a..32cda64 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1293,7 +1293,7 @@ SELECT viewname, definition FROM pg_views WHERE schemaname <> 'information_schem
  pg_stat_activity            | SELECT s.datid, d.datname, s.procpid, s.usesysid, u.rolname AS usename, s.application_name, s.client_addr, s.client_port, s.backend_start, s.xact_start, s.query_start, s.waiting, s.current_query FROM pg_database d, pg_stat_get_activity(NULL::integer) s(datid, procpid, usesysid, application_name, current_query, waiting, xact_start, query_start, backend_start, client_addr, client_port), pg_authid u WHERE ((s.datid = d.oid) AND (s.usesysid = u.oid));
  pg_stat_all_indexes         | SELECT c.oid AS relid, i.oid AS indexrelid, n.nspname AS schemaname, c.relname, i.relname AS indexrelname, pg_stat_get_numscans(i.oid) AS idx_scan, pg_stat_get_tuples_returned(i.oid) AS idx_tup_read, pg_stat_get_tuples_fetched(i.oid) AS idx_tup_fetch FROM (((pg_class c JOIN pg_index x ON ((c.oid = x.indrelid))) JOIN pg_class i ON ((i.oid = x.indexrelid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char"]));
  pg_stat_all_tables          | SELECT c.oid AS relid, n.nspname AS schemaname, c.relname, pg_stat_get_numscans(c.oid) AS seq_scan, pg_stat_get_tuples_returned(c.oid) AS seq_tup_read, (sum(pg_stat_get_numscans(i.indexrelid)))::bigint AS idx_scan, ((sum(pg_stat_get_tuples_fetched(i.indexrelid)))::bigint + pg_stat_get_tuples_fetched(c.oid)) AS idx_tup_fetch, pg_stat_get_tuples_inserted(c.oid) AS n_tup_ins, pg_stat_get_tuples_updated(c.oid) AS n_tup_upd, pg_stat_get_tuples_deleted(c.oid) AS n_tup_del, pg_stat_get_tuples_hot_updated(c.oid) AS n_tup_hot_upd, pg_stat_get_live_tuples(c.oid) AS n_live_tup, pg_stat_get_dead_tuples(c.oid) AS n_dead_tup, pg_stat_get_last_vacuum_time(c.oid) AS last_vacuum, pg_stat_get_last_autovacuum_time(c.oid) AS last_autovacuum, pg_stat_get_last_analyze_time(c.oid) AS last_analyze, pg_stat_get_last_autoanalyze_time(c.oid) AS last_autoanalyze, pg_stat_get_vacuum_count(c.oid) AS vacuum_count, pg_stat_get_autovacuum_count(c.oid) AS autovacuum_count, pg_stat_get_analyze_count(c.oid) AS analyze_count, pg_stat_get_autoanalyze_count(c.oid) AS autoanalyze_count FROM ((pg_class c LEFT JOIN pg_index i ON ((c.oid = i.indrelid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char"])) GROUP BY c.oid, n.nspname, c.relname;
- pg_stat_bgwriter            | SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed, pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req, pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint, pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean, pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean, pg_stat_get_buf_written_backend() AS buffers_backend, pg_stat_get_buf_alloc() AS buffers_alloc;
+ pg_stat_bgwriter            | SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed, pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req, pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint, pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean, pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean, pg_stat_get_buf_written_backend() AS buffers_backend, pg_stat_get_buf_sync_backend() AS buffers_backend_sync, pg_stat_get_buf_alloc() AS buffers_alloc;
  pg_stat_database            | SELECT d.oid AS datid, d.datname, pg_stat_get_db_numbackends(d.oid) AS numbackends, pg_stat_get_db_xact_commit(d.oid) AS xact_commit, pg_stat_get_db_xact_rollback(d.oid) AS xact_rollback, (pg_stat_get_db_blocks_fetched(d.oid) - pg_stat_get_db_blocks_hit(d.oid)) AS blks_read, pg_stat_get_db_blocks_hit(d.oid) AS blks_hit, pg_stat_get_db_tuples_returned(d.oid) AS tup_returned, pg_stat_get_db_tuples_fetched(d.oid) AS tup_fetched, pg_stat_get_db_tuples_inserted(d.oid) AS tup_inserted, pg_stat_get_db_tuples_updated(d.oid) AS tup_updated, pg_stat_get_db_tuples_deleted(d.oid) AS tup_deleted FROM pg_database d;
  pg_stat_sys_indexes         | SELECT pg_stat_all_indexes.relid, pg_stat_all_indexes.indexrelid, pg_stat_all_indexes.schemaname, pg_stat_all_indexes.relname, pg_stat_all_indexes.indexrelname, pg_stat_all_indexes.idx_scan, pg_stat_all_indexes.idx_tup_read, pg_stat_all_indexes.idx_tup_fetch FROM pg_stat_all_indexes WHERE ((pg_stat_all_indexes.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_indexes.schemaname ~ '^pg_toast'::text));
  pg_stat_sys_tables          | SELECT pg_stat_all_tables.relid, pg_stat_all_tables.schemaname, pg_stat_all_tables.relname, pg_stat_all_tables.seq_scan, pg_stat_all_tables.seq_tup_read, pg_stat_all_tables.idx_scan, pg_stat_all_tables.idx_tup_fetch, pg_stat_all_tables.n_tup_ins, pg_stat_all_tables.n_tup_upd, pg_stat_all_tables.n_tup_del, pg_stat_all_tables.n_tup_hot_upd, pg_stat_all_tables.n_live_tup, pg_stat_all_tables.n_dead_tup, pg_stat_all_tables.last_vacuum, pg_stat_all_tables.last_autovacuum, pg_stat_all_tables.last_analyze, pg_stat_all_tables.last_autoanalyze, pg_stat_all_tables.vacuum_count, pg_stat_all_tables.autovacuum_count, pg_stat_all_tables.analyze_count, pg_stat_all_tables.autoanalyze_count FROM pg_stat_all_tables WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_tables.schemaname ~ '^pg_toast'::text));

Attachment: fsync-stress-v2.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to