On 6/30/13 2:01 PM, Jeff Davis wrote:
Simple test program attached, which creates two files and fills them:
one by 2048 8KB writes; and another by 1 posix_fallocate of 16MB. Then,
I just cmp the resulting files (and also "ls" them, to make sure they
are 16MB).

This makes platform level testing a lot easier, thanks. Attached is an updated copy of that program with some error checking. If the files it creates already existed, the code didn't notice, and a series of write errors happened. If you set the test up right it's not a problem, but it's better if a bad setup is caught. I wrapped the whole test with a shell script, also attached, which insures the right test sequence and checks.

Your C test program compiles and passes on RHEL5/6 here, doesn't on OS X Darwin. No surprises there, there's a long list of platforms that don't support this call at https://www.gnu.org/software/gnulib/manual/html_node/posix_005ffallocate.html and the Mac is on it. Many other platforms I was worried about don't support it too--older FreeBSD, HP-UX 11, Solaris 10, mingw, MSVC--so that cuts down on testing quite a bit. If it runs faster on Linux, that's the main target here, just like the existing effective_io_concurrency fadvise code.

The specific thing I was worried about is that this interface might have a stub that doesn't work perfectly in older Linux kernels. After being surprised to find this interface worked on RHEL5 with your test program, I dug into this more. It works there, but it may actually be slower.

posix_fallocate is actually implemented by glibc on Linux. Been there since 2.1.94 according to the Linux man pages. But Linux itself didn't add the feature until kernel 2.6.20: http://lwn.net/Articles/226436/ The biggest thing I was worried about--the call might be there in early kernels but with a non-functional implementation--that's not the case. Looking at the diff, before that patch there's no fallocate at all.

So what happened in earlier kernels, where there was no kernel level fallocate available? According to https://www.redhat.com/archives/fedora-devel-list/2009-April/msg00110.html what glibc does is check for kernel fallocate(), and if it's not there it writes a bunch of zeros to create the file instead. What is actually happening on a RHEL5 system (with kernel 2.6.18) is that calling posix_fallocate does this fallback behavior, where it basically does the same thing the existing WAL clearing code does.

I can even prove that's the case. On RHEL5, if you run "strace -o out ./fallocate" the main write loop looks like this:

write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192

But when you call posix_fallocate, you still get a bunch of writes, but 4 bytes at a time:

pwrite(4, "\0", 1, 16769023)            = 1
pwrite(4, "\0", 1, 16773119)            = 1
pwrite(4, "\0", 1, 16777215)            = 1

That's glibc helpfully converting your call to posix_fallocate into small writes, because the OS doesn't provide a better way in that kernel. It's not hard to imagine this being slower than what the WAL code is doing right now. I'm not worried about correctness issues anymore, but my gut paranoia about this not working as expected on older systems was justified. Everyone who thought I was just whining owes me a cookie.

This is what I plan to benchmark specifically next. If the posix_fallocate approach is actually slower than what's done now when it's not getting kernel acceleration, which is the case on RHEL5 era kernels, we might need to make the configure time test more complicated. Whether posix_fallocate is defined isn't sensitive enough; on Linux it may be the case that this only is usable when fallocate() is also there.

Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
rm -f fallocate /tmp/afile /tmp/bfile
gcc fallocate.c -o fallocate
if [ ! -x fallocate ] ; then
  echo Test program did not compile, posix_fallocate may not be supported

if [ -f /tmp/afile ] ; then
  sizea=`du /tmp/afile | cut -f 1`
  sizeb=`du /tmp/bfile | cut -f 1`
  if [ "$sizea" -eq "$sizeb" ] ; then
    cmp /tmp/afile /tmp/bfile
    if [ "$?" -ne 0 ] ; then
      echo Test failed, files do not match
      echo Test passed
    echo Test failed, sizes do not match
#include <fcntl.h>
#include <stdio.h>

char buf[8192] = {0};

int main()
	int i;
	int written;
	int fda = open("/tmp/afile", O_CREAT | O_EXCL | O_WRONLY, 0600);
	int fdb = open("/tmp/bfile", O_CREAT | O_EXCL | O_WRONLY, 0600);
	if (fda < 0 || fdb < 0)
		printf("Opening files failed\n");
	for(i = 0; i < 2048; i++)
			written=write(fda, buf, 8192);
			if (written < 8192)
				printf("Write to file failed");

	posix_fallocate(fdb, 0, 16*1024*1024);


	return 0;
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to