[HACKERS] CreateLockFile() race condition

Noah Misch Fri, 03 Aug 2012 07:57:12 -0700

Despite its efforts, CreateLockFile() does not reliably prevent multiple
closely-timed postmasters from completing startup on the same data directory.
This test procedure evades its defenses:

1. kill -9 the running postmaster for $PGDATA
2. "mkdir /tmp/sock1 /tmp/sock2"
3. "gdb postgres", "b unlink", "run -k /tmp/sock1". Hits breakpoint.
4. "postgres -k /tmp/sock2"
5. In gdb: "d 1", "c".

The use of two socket directories is probably not essential, but it makes the
test case simpler.

The problem here is a race between concluding the assessment of a PID file as
defunct and unlinking it; during that period, another postmaster may have
replaced the PID file and proceeded. As far as I've been able to figure, this
flaw is fundamental to any PID file invalidation algorithm relying solely on
atomic filesystem operations like unlink(2), link(2), rename(2) and small
write(2) for mutual exclusion. Do any of you see a way to remove the race?

I think we should instead implement postmaster mutual exclusion by way of
fcntl(F_SETLK) on Unix and CreateFile(..., FILE_SHARE_READ, ...) on Windows.
PostgreSQL used fcntl(F_SETLK) on its Unix socket a decade ago, and that usage
was removed[1] due to portability problems[2][3]. POSIX does not require it
to work on other than regular files, but I can't find evidence of a Unix-like
platform not supporting it for regular files. Perl's metaconfig does test it,
with a note about VMS and DOS/DJGPP lacking support. Gnulib reports only the
lack of Windows support, though the Gnulib test suite does not cover F_SETLK.

CreateLockFile() would have this algorithm:

1. open("postmaster.pid", O_RDWR | O_CREAT, 0600)
2. Request a fcntl write lock on the entire postmaster.pid file. If this
fails, abandon startup: another postmaster is running.
3. Read the PID from the file. If the read returns 0 bytes, either we created
the file or the postmaster creating it crashed before writing and syncing any
content. Such a postmaster would not have reached the point of allocating
shared memory or forking children, so we're safe to proceed in either case.
Any other content problems should never happen and can remain fatal errors.
4. Check any old PID as we do today. In theory, this should always be
redundant with the fcntl lock. However, since the PID check code is mature,
let's keep it around indefinitely as a backstop. Perhaps adjust the error
message to better reflect its "can't happen" nature, though.
5. Read the shared memory key from the file. If we find one, probe it as we
do today. Otherwise, the PID file's creating postmaster crashed before
writing and syncing that value. Such a postmaster would not have reached the
point of forking children, so we're safe to proceed.
6. ftruncate() the file and write our own content.
7. Set FD_CLOEXEC on the file, leaving it open to retain the lock.

AddToDataDirLockFile() would use the retained file descriptor. On a related
but independent note, I think the ereport(LOG) calls in that function should
be ereport(FATAL) as they are in CreateLockFile(). We're not significantly
further into the startup process, and failing to add the shared memory key to
the file creates a hazardous situation.

The hazard[4] keeping fcntl locking from replacing the PGSharedMemoryIsInUse()
check does not apply here, because the postmaster itself does not run
arbitrary code that might reopen postmaster.pid.

This change adds some interlock to data directories shared over NFS. It also
closes the occasionally-reported[5][6] problem that an empty postmaster.pid
blocks cluster startup; we no longer face the possibility that an empty file
is the just-created product of a concurrent postmaster. It's probably no
longer necessary to fsync postmaster.pid, though I'm disinclined to change
that straightway.

Do any of you see holes in that design or have better ideas?

Thanks,
nm

[1]
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=792b0f4666b6ea6346aa8d29b568e5d3fe1fcef5
[2] http://archives.postgresql.org/pgsql-hackers/2000-07/msg00359.php
[3] http://archives.postgresql.org/pgsql-hackers/2000-11/msg01306.php
[4] http://archives.postgresql.org/pgsql-hackers/2012-06/msg01598.php
[5] http://archives.postgresql.org/pgsql-hackers/2010-08/msg01094.php
[6] http://archives.postgresql.org/pgsql-hackers/2012-01/msg00233.php

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] CreateLockFile() race condition

Reply via email to