On Mon, Nov 26, 2012 at 02:43:19PM -0500, Bruce Momjian wrote:
> > >> In any event, I think the documentation should caution that the
> > >> upgrade should not be deemed to be a success until after a system-wide
> > >> sync has been done.  Even if we use the link rather than copy method,
> > >> are we sure that that is safe if the directories recording those links
> > >> have not been fsynced?
> > >
> > > OK, the above is something I have been thinking about, and obviously you
> > > have too.  If you change fsync from off to on in a cluster, and restart
> > > it, there is no guarantee that the dirty pages you read from the kernel
> > > are actually on disk, because Postgres doesn't know they are dirty.
> > > They probably will be pushed to disk by the kernel in less than one
> > > minute, but still, it doesn't seem reliable. Should this be documented
> > > in the fsync section?
> > >
> > > Again, another reason not to use fsync=off, though your example of the
> > > file copy is a good one.  As you stated, this is a problem with the file
> > > copy/link, independent of how Postgres handles the files.  We can tell
> > > people to use 'sync' as root on Unix, but what about Windows?
> > 
> > I'm pretty sure someone mentioned the way to do that on Windows in
> > this list in the last few months, but I can't seem to find it.  I
> > thought it was the initdb fsync thread.
> 
> Yep, the code is already in initdb to fsync a directory --- we just need
> a way for pg_upgrade to access it.

I have developed the attached patch that does this.  It basically adds
an --sync-only option to initdb, then turns off all durability in
pg_upgrade and has pg_upgrade run initdb --sync-only;  this give us
another nice speedup!

                 ------ SSD ---- -- magnetic ---
                    git    patch    git    patch
            1      11.11   11.11   11.10   11.13
         1000      20.57   19.89   20.72   19.30
         2000      28.02   25.81   28.50   27.53
         4000      42.00   43.59   46.71   46.84
         8000      89.66   74.16   89.10   73.67
        16000     157.66  135.98  159.97  153.48
        32000     316.24  296.90  334.74  308.59
        64000     814.97  715.53  797.34  727.94

(I am very happy with these times.  Thanks to Jeff Janes for his
suggestions.)

I have also added documentation to the 'fsync' configuration variable
warning about dirty buffers and recommending flushing them to disk
before the cluster is crash-recovery safe.

I consider this patch ready for 9.3 application (meaning it is not a
prototype).

-- 
  Bruce Momjian  <br...@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/contrib/pg_upgrade/pg_upgrade.c b/contrib/pg_upgrade/pg_upgrade.c
new file mode 100644
index c12f15b..63df529
*** a/contrib/pg_upgrade/pg_upgrade.c
--- b/contrib/pg_upgrade/pg_upgrade.c
*************** main(int argc, char **argv)
*** 150,155 ****
--- 150,161 ----
  			  new_cluster.pgdata);
  	check_ok();
  
+ 	prep_status("Sync data directory to disk");
+ 	exec_prog(UTILITY_LOG_FILE, NULL, true,
+ 			  "\"%s/initdb\" --sync-only \"%s\"", new_cluster.bindir,
+ 			  new_cluster.pgdata);
+ 	check_ok();
+ 
  	create_script_for_cluster_analyze(&analyze_script_file_name);
  	create_script_for_old_cluster_deletion(&deletion_script_file_name);
  
diff --git a/contrib/pg_upgrade/server.c b/contrib/pg_upgrade/server.c
new file mode 100644
index 49d4c8f..05d8cc0
*** a/contrib/pg_upgrade/server.c
--- b/contrib/pg_upgrade/server.c
*************** start_postmaster(ClusterInfo *cluster)
*** 209,217 ****
  	 * a gap of 2000000000 from the current xid counter, so autovacuum will
  	 * not touch them.
  	 *
! 	 *	synchronous_commit=off improves object creation speed, and we only
! 	 *	modify the new cluster, so only use it there.  If there is a crash,
! 	 *	the new cluster has to be recreated anyway.
  	 */
  	snprintf(cmd, sizeof(cmd),
  			 "\"%s/pg_ctl\" -w -l \"%s\" -D \"%s\" -o \"-p %d%s%s%s%s\" start",
--- 209,217 ----
  	 * a gap of 2000000000 from the current xid counter, so autovacuum will
  	 * not touch them.
  	 *
! 	 * Turn off durability requirements to improve object creation speed, and
! 	 * we only modify the new cluster, so only use it there.  If there is a
! 	 * crash, the new cluster has to be recreated anyway.
  	 */
  	snprintf(cmd, sizeof(cmd),
  			 "\"%s/pg_ctl\" -w -l \"%s\" -D \"%s\" -o \"-p %d%s%s%s%s\" start",
*************** start_postmaster(ClusterInfo *cluster)
*** 219,225 ****
  			 (cluster->controldata.cat_ver >=
  			  BINARY_UPGRADE_SERVER_FLAG_CAT_VER) ? " -b" :
  			 " -c autovacuum=off -c autovacuum_freeze_max_age=2000000000",
! 			 (cluster == &new_cluster) ? " -c synchronous_commit=off" : "",
  			 cluster->pgopts ? cluster->pgopts : "", socket_string);
  
  	/*
--- 219,226 ----
  			 (cluster->controldata.cat_ver >=
  			  BINARY_UPGRADE_SERVER_FLAG_CAT_VER) ? " -b" :
  			 " -c autovacuum=off -c autovacuum_freeze_max_age=2000000000",
! 			 (cluster == &new_cluster) ?
! 				" -c synchronous_commit=off -c fsync=off -c full_page_writes=off" : "",
  			 cluster->pgopts ? cluster->pgopts : "", socket_string);
  
  	/*
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
new file mode 100644
index b56070b..b7df8ce
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*************** include 'filename'
*** 1697,1702 ****
--- 1697,1711 ----
         </para>
  
         <para>
+         For reliable recovery when changing <varname>fsync</varname>
+         off to on, it is necessary to force all modified buffers in the
+         kernel to durable storage.  This can be done while the cluster
+         is shutdown or while fsync is on by running <command>initdb
+         --sync-only</command>, running <command>sync</>, unmounting the
+         file system, or rebooting the server.
+        </para>
+ 
+        <para>
          In many situations, turning off <xref linkend="guc-synchronous-commit">
          for noncritical transactions can provide much of the potential
          performance benefit of turning off <varname>fsync</varname>, without
diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
new file mode 100644
index 08ee37e..a1e46eb
*** a/doc/src/sgml/ref/initdb.sgml
--- b/doc/src/sgml/ref/initdb.sgml
*************** PostgreSQL documentation
*** 245,250 ****
--- 245,261 ----
       </varlistentry>
  
       <varlistentry>
+       <term><option>-S</option></term>
+       <term><option>--sync-only</option></term>
+       <listitem>
+        <para>
+         Safely write all database files to disk and exit.  This does not
+         perform any of the normal <application>initdb</> operations.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
+      <varlistentry>
        <term><option>-T <replaceable>CFG</></option></term>
        <term><option>--text-search-config=<replaceable>CFG</></option></term>
        <listitem>
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
new file mode 100644
index 402504b..8c0a9f4
*** a/src/bin/initdb/initdb.c
--- b/src/bin/initdb/initdb.c
*************** static const char *authmethodlocal = "";
*** 118,123 ****
--- 118,124 ----
  static bool debug = false;
  static bool noclean = false;
  static bool do_sync = true;
+ static bool sync_only = false;
  static bool show_setting = false;
  static char *xlog_dir = "";
  
*************** usage(const char *progname)
*** 2796,2801 ****
--- 2797,2803 ----
  	printf(_("  -n, --noclean             do not clean up after errors\n"));
  	printf(_("  -N, --nosync              do not wait for changes to be written safely to disk\n"));
  	printf(_("  -s, --show                show internal settings\n"));
+ 	printf(_("  -S, --sync-only           only sync data directory\n"));
  	printf(_("\nOther options:\n"));
  	printf(_("  -V, --version             output version information, then exit\n"));
  	printf(_("  -?, --help                show this help, then exit\n"));
*************** main(int argc, char *argv[])
*** 3445,3450 ****
--- 3447,3453 ----
  		{"show", no_argument, NULL, 's'},
  		{"noclean", no_argument, NULL, 'n'},
  		{"nosync", no_argument, NULL, 'N'},
+ 		{"sync-only", no_argument, NULL, 'S'},
  		{"xlogdir", required_argument, NULL, 'X'},
  		{NULL, 0, NULL, 0}
  	};
*************** main(int argc, char *argv[])
*** 3476,3482 ****
  
  	/* process command-line options */
  
! 	while ((c = getopt_long(argc, argv, "dD:E:L:nNU:WA:sT:X:", long_options, &option_index)) != -1)
  	{
  		switch (c)
  		{
--- 3479,3485 ----
  
  	/* process command-line options */
  
! 	while ((c = getopt_long(argc, argv, "dD:E:L:nNU:WA:sST:X:", long_options, &option_index)) != -1)
  	{
  		switch (c)
  		{
*************** main(int argc, char *argv[])
*** 3522,3527 ****
--- 3525,3533 ----
  			case 'N':
  				do_sync = false;
  				break;
+ 			case 'S':
+ 				sync_only = true;
+ 				break;
  			case 'L':
  				share_path = pg_strdup(optarg);
  				break;
*************** main(int argc, char *argv[])
*** 3589,3594 ****
--- 3595,3608 ----
  		exit(1);
  	}
  
+ 	/* If we only need to fsync, just to it and exit */
+ 	if (sync_only)
+ 	{
+ 		setup_pgdata();
+ 		perform_fsync();
+ 		return 0;
+ 	}
+ 	
  	if (pwprompt && pwfilename)
  	{
  		fprintf(stderr, _("%s: password prompt and password file cannot be specified together\n"), progname);
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to