Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

Greg Smith Sun, 30 Jun 2013 15:56:38 -0700

On 6/30/13 2:01 PM, Jeff Davis wrote:

Simple test program attached, which creates two files and fills them:
one by 2048 8KB writes; and another by 1 posix_fallocate of 16MB. Then,
I just cmp the resulting files (and also "ls" them, to make sure they
are 16MB).

This makes platform level testing a lot easier, thanks. Attached is anupdated copy of that program with some error checking. If the files itcreates already existed, the code didn't notice, and a series of writeerrors happened. If you set the test up right it's not a problem, butit's better if a bad setup is caught. I wrapped the whole test with ashell script, also attached, which insures the right test sequence andchecks.

Your C test program compiles and passes on RHEL5/6 here, doesn't on OS XDarwin. No surprises there, there's a long list of platforms that don'tsupport this call athttps://www.gnu.org/software/gnulib/manual/html_node/posix_005ffallocate.htmland the Mac is on it. Many other platforms I was worried about don'tsupport it too--older FreeBSD, HP-UX 11, Solaris 10, mingw, MSVC--sothat cuts down on testing quite a bit. If it runs faster on Linux,that's the main target here, just like the existingeffective_io_concurrency fadvise code.

The specific thing I was worried about is that this interface might havea stub that doesn't work perfectly in older Linux kernels. After beingsurprised to find this interface worked on RHEL5 with your test program,I dug into this more. It works there, but it may actually be slower.

posix_fallocate is actually implemented by glibc on Linux. Been theresince 2.1.94 according to the Linux man pages. But Linux itself didn'tadd the feature until kernel 2.6.20: http://lwn.net/Articles/226436/The biggest thing I was worried about--the call might be there in earlykernels but with a non-functional implementation--that's not the case.Looking at the diff, before that patch there's no fallocate at all.

So what happened in earlier kernels, where there was no kernel levelfallocate available? According tohttps://www.redhat.com/archives/fedora-devel-list/2009-April/msg00110.htmlwhat glibc does is check for kernel fallocate(), and if it's not thereit writes a bunch of zeros to create the file instead. What is actuallyhappening on a RHEL5 system (with kernel 2.6.18) is that callingposix_fallocate does this fallback behavior, where it basically does thesame thing the existing WAL clearing code does.

I can even prove that's the case. On RHEL5, if you run "strace -o out./fallocate" the main write loop looks like this:

write(3,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,8192) = 8192

But when you call posix_fallocate, you still get a bunch of writes, but4 bytes at a time:


pwrite(4, "\0", 1, 16769023)            = 1
pwrite(4, "\0", 1, 16773119)            = 1
pwrite(4, "\0", 1, 16777215)            = 1

That's glibc helpfully converting your call to posix_fallocate intosmall writes, because the OS doesn't provide a better way in thatkernel. It's not hard to imagine this being slower than what the WALcode is doing right now. I'm not worried about correctness issuesanymore, but my gut paranoia about this not working as expected on oldersystems was justified. Everyone who thought I was just whining owes mea cookie.

This is what I plan to benchmark specifically next. If theposix_fallocate approach is actually slower than what's done now whenit's not getting kernel acceleration, which is the case on RHEL5 erakernels, we might need to make the configure time test more complicated.Whether posix_fallocate is defined isn't sensitive enough; on Linux itmay be the case that this only is usable when fallocate() is also there.


--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

#!/bin/sh
rm -f fallocate /tmp/afile /tmp/bfile
gcc fallocate.c -o fallocate
if [ ! -x fallocate ] ; then
  echo Test program did not compile, posix_fallocate may not be supported
  exit
fi

./fallocate
if [ -f /tmp/afile ] ; then
  sizea=`du /tmp/afile | cut -f 1`
  sizeb=`du /tmp/bfile | cut -f 1`
  if [ "$sizea" -eq "$sizeb" ] ; then
    cmp /tmp/afile /tmp/bfile
    if [ "$?" -ne 0 ] ; then
      echo Test failed, files do not match
    else
      echo Test passed
    fi
  else
    echo Test failed, sizes do not match
  fi
fi

#include <fcntl.h>
#include <stdio.h>

char buf[8192] = {0};

int main()
{
	int i;
	int written;
	int fda = open("/tmp/afile", O_CREAT | O_EXCL | O_WRONLY, 0600);
	int fdb = open("/tmp/bfile", O_CREAT | O_EXCL | O_WRONLY, 0600);
	if (fda < 0 || fdb < 0)
	{
		printf("Opening files failed\n");
		return(1);
	}
	for(i = 0; i < 2048; i++)
		{
			written=write(fda, buf, 8192);
			if (written < 8192)
			{
				printf("Write to file failed");
				return(2);
			}
		}

	posix_fallocate(fdb, 0, 16*1024*1024);

	close(fda);
	close(fdb);

	return 0;
}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

Reply via email to