Hi, I'd like to help with the topic in the Subject: line.  It seems to be a    
TODO item.  I've reviewed some threads discussing the matter, so I hope I've    
acquired enough history concerning it.  I've taken an initial swipe at    
figuring out how to optimize sync'ing methods.  It's based largely on   
recommendations I've read on previous threads about fsync/O_SYNC and so on.    
After reviewing, if anybody has recommendations on how to proceed then I'd   
love to hear them.  
Attached is a little program that basically does a bunch of sequential writes   
to a file.  All of the sync'ing methods supported by PostgreSQL WAL can be   
used.  Results are printed in microseconds.  Size and quanity of writes are   
configurable.  The documentation is in the code (how to configure, build, run,   
etc.).  I realize that this program doesn't reflect all of the possible   
activities of a production database system, but I hope it's a step in the   
right direction for this task.  I've used it to see differences in behavior   
between the various sync'ing methods on various platforms.   
Here's what I've found running the benchmark on some systems to which  
I have access.  The differences in behavior between platforms is quite vast.   
Summary first...   
PostgreSQL should be run on an old Apple MacIntosh attached to   
its own Hitachi disk array with 2GB cache or so.  Use any sync method   
except for fsync().   
Anyway, there is *a lot* of variance in file synching behavior across   
different hardware and O/S platforms.  It's probably not safe   
to conclude much.  That said, here are some findings so far based on   
tests I've run:  
1.  under no circumstances do fsync() or fdatasync() seem to perform   
better than opening files with O_SYNC or O_DSYNC   
2.  where there are differences, opening files with O_SYNC or O_DSYNC   
tends to be quite faster.  
3.  fsync() seems to be the slowest where there are differences.  And   
O_DSYNC seems to be the fastest where results differ.   
4.  the safest thing to assert at this point is that   
Solaris systems ought to use the O_DSYNC method for WAL.   
Test system(s)   
Athlon Linux:   
AMD Athlon XP2000, 512MB RAM, single (54 or 7200?) RPM 20GB IDE disk,   
reiserfs filesystem (3 something I think)   
SuSE Linux kernel 2.4.21-99   
Mac Linux:   
I don't know the specific model.  400MHz G3, 512MB, single IDE disk,   
ext2 filesystem   
Debian GNU/Linux 2.4.16-powerpc   
HP Intel Linux:   
Prolient HPDL380G3, 2 x 3GHz Xeon, 2GB RAM, SmartArray 5i 64MB cache,   
2 x 15,000RPM 36GB U320 SCSI drives mirrored.  I'm not sure if   
writes are cached or not.  There's no battery backup.   
ext3 filesystem.   
Redhat Enterprise Linux 3.0 kernel based on 2.4.21   
Dell Intel OpenBSD:   
Poweredge ?, single 1GHz PIII, 128MB RAM, single 7200RPM 80GB IDE disk,   
ffs filesystem   
OpenBSD 3.2 GENERIC kernel   
SUN Ultra2:   
Ultra2, 2 x 296MHz UltraSPARC II, 2GB RAM, 2 x 10,000RPM 18GB U160   
SCSI drives mirrored with Solstice DiskSuite.  UFS filesystem.   
Solaris 8.   
SUN E4500 + HDS Thunder 9570v   
E4500, 8 x 400MHz UltraSPARC II, 3GB RAM,   
HDS Thunder 9570v, 2GB mirrored battery-backed cache, RAID5 with a   
bunch of 146GB 10,000RPM FC drives.  LUN is on single 2GB FC fabric   
Veritas filesystem (VxFS)   
Solaris 8.   
Test methodology:   
All test runs were done with CHUNKSIZE 8 * 1024, CHUNKS 2 * 1024,   
FILESIZE_MULTIPLIER 2, and SLEEP 5.  So a total of 16MB was sequentially  
written for each benchmark.  
Results are in microseconds.   
PLATFORM:       Athlon Linux   
buffered:       48220   
fsync:          74854397   
fdatasync:      75061357   
open_sync:      73869239   
open_datasync:  74748145   
Notes:  System mostly idle.  Even during tests, top showed about 95%   
idle.  Something's not right on this box.  All sync methods similarly   
horrible on this system.   
PLATFORM:       Mac Linux   
buffered:       58912   
fsync:          1539079   
fdatasync:      769058   
open_sync:      767094   
open_datasync:  763074   
Notes: system mostly idle.  fsync seems worst.  Otherwise, they seem   
pretty equivalent.  This is the fastest system tested.  
PLATFORM:       HP Intel Linux   
buffered:       33026   
fsync:          29330067   
fdatasync:      28673880   
open_sync:      8783417   
open_datasync:  8747971   
Notes: system idle.  O_SYNC and O_DSYNC methods seem to be a lot   
better on this platform than fsync & fdatasync.  
PLATFORM:       Dell Intel OpenBSD  
buffered:       511890  
fsync:          1769190  
fdatasync:      --------  
open_sync:      1748764  
open_datasync:  1747433  
Notes: system idle.  I couldn't locate fdatasync() on this box, so I  
couldn't test it.  All sync methods seem equivalent and are very fast --  
though still trail the old Mac.  
PLATFORM:       SUN Ultra2  
buffered:       1814824  
fsync:          73954800  
fdatasync:      52594532  
open_sync:      34405585  
open_datasync:  13883758  
Notes:  system mostly idle, with occasional spikes from 1-10% utilization.  
It looks like substantial difference between each sync method, with  
O_DSYNC the best and fsync() the worst.  There is substantial  
difference between the open* and f* methods. 
PLATFORM:       SUN E4500 + HDS Thunder 9570v  
buffered:       233947  
fsync:          57802065  
fdatasync:      56631013  
open_sync:      2362207  
open_datasync:  1976057  
Notes:  host about 30% idle, but the array tested on was completely idle.  
Something looks seriously not right about fsync and fdatasync -- write  
cache seems to have no effect on them.  As for write cache, that  
probably explains the 2 seconds or so for the open_sync and  
open_datasync methods.  
Thanks for reading...I look forward to feedback, and hope to be helpful in  
this effort! 
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
#include <sys/time.h>


syncbench: a program to help determine optimal i/o synchronization
written by Mark Travis

Copyright (c) 2004 Mark Travis

Permission to use, copy, modify, and distribute this software and
its documentation for any purpose and without fee is hereby
granted, provided that the above copyright notice appear in all
copies and that both that copyright notice and this permission
notice appear in supporting documentation, and that the name of the
author not be used in advertising or publicity pertaining to
distribution of the software without specific, written prior
permission. The author makes no representations about the
suitability of this software for any purpose.  It is provided "as
is" without express or implied warranty.


To help "Determine optimal fdatasync/fsync, O_SYNC/O_DSYNC options"
from the PostgreSQL TODO list (http://developer.postgresql.org/todo.php)


Writes sequential chunks of data to a file and synchronizes with the
I/O device in a variety of ways.  Gives results in microseconds.


There are 6 things which must be defined in order for this thing to

FILE_SYNC and OPEN_SYNC define how file contents are to be sync'd, and
each platform has various options.  FILE_SYNC defines the function
call made to sync after each write().  Most platforms should support fsync(2)
at a minimum.  fdatasync(2) should be faster on platforms which support

OPEN_SYNC defines the flag to be set in open(2) for
synchronizing all writes without explicitly calling fsync(2) or its
kind.  Most platforms should support O_SYNC at a minimum.  O_DSYNC should be
faster on platforms which support it.  And some environments have
neither.  I could only find O_FSYNC on FreeBSD 4.10.  Anyway, reading man
pages for open(2), fsync(2), and fdatasync(2) is probably a good idea
before setting FILE_SYNC and OPEN_SYNC.


#define FILE_SYNC(X)    fdatasync(X)
#define OPEN_SYNC       O_DSYNC


SLEEP.  CHUNKSIZE is the size of each chunk of data written in the test.
CHUNKS is the number of times they are written.  All writes are
sequential.  Before the test is run, a file is created, filled with
"A" characters, fsync(2)'d, and then closed.  Then we sleep for SLEEP
seconds before proceeding with the actual benchmark run.  The sleep
is to let the I/O device quiesce if it wants to.

The size of the file created is FILESIZE_MULTIPLER times CHUNKSIZE
times CHUNKS.  So if FILESIZE_MULTIPLIER exceeds 1 then the file will
be bigger than the amount of data written to it.  Having the file size
equal to or exceeding the amount of data written should help to simulate
real-world WAL behavior.  Extending the size of a file requires extra
work for the filesystem to perform.  To learn the impact of that, go
ahead and set FILESIZE_MULTIPLIER to less than 1 if you want.


#define CHUNKSIZE	8 * 1024
#define CHUNKS		2 * 1024
#define SLEEP		5


Nothing else should need to be modified from this point, but please
keep reading for building and running instructions.

It should be simple as long as the right tools are in your $PATH (cc,

Linux and *BSD seem to be happy with this:
cc syncbench.c -o syncbench

Solaris 8 requires -lrt in order to support fdatasync:
cc syncbench.c -lrt -o syncbench

Other O/S's you'll have to figure out on your own.

It takes two arguments.  The first is the name of the file to use for
the test.  If the file doesn't exist already, it will be created.  If
it does exist already, then it will be truncated before writing.

The second argument is the mode in which file data is synchronized to
disk.  It can be one of buffered, filesync and opensync.  buffered
means no syncing takes place.  Unless your filesystem is mounted
synchronously this should be the fastest option by far.  filesync
executes the function defined by FILE_SYNC above after each chunk of
data is written.  opensync open()'s the file with the flag defined in

PostgreSQL supports 5 methods of sync'ing if you count "fsync=false".  These
methods are in postgresql.conf under the "WRITE AHEAD LOG" section.

Here's how to use this tool to try and simulate those methods:

PostgreSQL Method:	Syncbench
fsync=false		run with 2nd argument=buffered
fsync			#define FILE_SYNC(X)	fsync(X)
			2nd argument=filesync
fdatasync		#define FILE_SYNC(X)	fdatasync(X)
			2nd argument=filesync
open_sync		#define OPEN_SYNC	O_SYNC
			2nd argument=opensync
open_datasync		#define OPEN_SYNC	O_DSYNC
			2nd argument=opensync

Obviously, if the platform doesn't support those things for whatever
reason then the program won't build.


The number of microseconds between writes starting and ending is
displayed.  Time taken to pre-populate, open(), or close() the
datafile is not included in this calculation.  Obviously, the results
will vary based on the amount of data written, synchronization
options used, hardware, O/S, filesystem, etc.  It's a good idea to try
to run this on systems which are as idle as possible -- especially the
I/O device(s) being tested.


int main(int argc, char *argv[])

  int fd, n, bm_openflag, bm_do_filesync;
  char *buf; 
  struct timeval tv_before, tv_after;
  struct timezone tz_garbage;

  if ( argc != 3 )
  printf("usage: %s <filename> <buffered|filesync|opensync>\n", argv[0] );


  if ( !strncmp( argv[2], "buffered", strlen("async") ) )
    printf("test uses %s\n", argv[2]);
  } else if ( !strncmp( argv[2], "filesync", strlen("filesync") ) )
    printf("test uses %s\n", argv[2]);
  } else if ( !strncmp( argv[2], "opensync", strlen("opensync") ) )
    printf("test uses %s\n", argv[2]);
  } else
    puts("Second argument must be one of async filesync opensync");

  printf("Starting test with FILESIZE: %i, CHUNKSIZE: %i, CHUNKS: %i\n", \
  fd = open( argv[1], O_WRONLY | O_CREAT | O_TRUNC, 0666 );
  if ( fd == -1 )
    perror("Can't create data file");
  buf = malloc(CHUNKSIZE);
  if ( buf == NULL )
    puts("malloc choked for some reason.  Bye!");
     Make sure that the whole file is not made up of NULLs.
     I seem to recall a characteristic of *NIX filesystems that likes
     to not populate NULL-filled files with actual blocks full of NULLs.
     This may cause actual population of the file to cause an update
     of metadata, which might cause performance issues.  Anyway, 
     pre-populate the file with "A" to hedge against that.
     It may be desirable to pre-populate just like WAL is pre-populated.
     Or maybe this doesn't matter.
  memset(buf, 65, CHUNKSIZE);
    if ( write(fd, buf, CHUNKSIZE) == -1 )
      perror("Can't write to data file");


  if ( fsync(fd) )
    perror("Can't fsync data file");

  if ( close(fd) )
    perror("Can't close data file");

  /* sleep for grins.  so maybe the device slows down */

/* OK it's created.  Now open it again then start the show */

  fd = open( argv[1], O_WRONLY | bm_openflag, 0666 );
  if ( fd == -1 )
    perror("Can't open data file for writing");
  if ( buf == NULL )
    puts("malloc choked for some reason.");
  memset(buf, 66, CHUNKSIZE);

  gettimeofday( &tv_before, &tz_garbage );
  for ( n=0; n<CHUNKS; n++)
    if ( write(fd, buf, CHUNKSIZE) == -1 )
      perror("Can't write to existing data file");
    if ( bm_do_filesync )
      if ( FILE_SYNC(fd) )
        perror("Can't sync data file");
  gettimeofday( &tv_after, &tz_garbage );

  printf("Elapsed usec: %i\n", ( ((tv_after.tv_sec * 1000000) +
    tv_after.tv_usec) - ((tv_before.tv_sec * 1000000) + tv_before.tv_usec) ) );


---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?


Reply via email to