Despite its efforts, CreateLockFile() does not reliably prevent multiple closely-timed postmasters from completing startup on the same data directory. This test procedure evades its defenses:
1. kill -9 the running postmaster for $PGDATA 2. "mkdir /tmp/sock1 /tmp/sock2" 3. "gdb postgres", "b unlink", "run -k /tmp/sock1". Hits breakpoint. 4. "postgres -k /tmp/sock2" 5. In gdb: "d 1", "c". The use of two socket directories is probably not essential, but it makes the test case simpler. The problem here is a race between concluding the assessment of a PID file as defunct and unlinking it; during that period, another postmaster may have replaced the PID file and proceeded. As far as I've been able to figure, this flaw is fundamental to any PID file invalidation algorithm relying solely on atomic filesystem operations like unlink(2), link(2), rename(2) and small write(2) for mutual exclusion. Do any of you see a way to remove the race? I think we should instead implement postmaster mutual exclusion by way of fcntl(F_SETLK) on Unix and CreateFile(..., FILE_SHARE_READ, ...) on Windows. PostgreSQL used fcntl(F_SETLK) on its Unix socket a decade ago, and that usage was removed[1] due to portability problems[2][3]. POSIX does not require it to work on other than regular files, but I can't find evidence of a Unix-like platform not supporting it for regular files. Perl's metaconfig does test it, with a note about VMS and DOS/DJGPP lacking support. Gnulib reports only the lack of Windows support, though the Gnulib test suite does not cover F_SETLK. CreateLockFile() would have this algorithm: 1. open("postmaster.pid", O_RDWR | O_CREAT, 0600) 2. Request a fcntl write lock on the entire postmaster.pid file. If this fails, abandon startup: another postmaster is running. 3. Read the PID from the file. If the read returns 0 bytes, either we created the file or the postmaster creating it crashed before writing and syncing any content. Such a postmaster would not have reached the point of allocating shared memory or forking children, so we're safe to proceed in either case. Any other content problems should never happen and can remain fatal errors. 4. Check any old PID as we do today. In theory, this should always be redundant with the fcntl lock. However, since the PID check code is mature, let's keep it around indefinitely as a backstop. Perhaps adjust the error message to better reflect its "can't happen" nature, though. 5. Read the shared memory key from the file. If we find one, probe it as we do today. Otherwise, the PID file's creating postmaster crashed before writing and syncing that value. Such a postmaster would not have reached the point of forking children, so we're safe to proceed. 6. ftruncate() the file and write our own content. 7. Set FD_CLOEXEC on the file, leaving it open to retain the lock. AddToDataDirLockFile() would use the retained file descriptor. On a related but independent note, I think the ereport(LOG) calls in that function should be ereport(FATAL) as they are in CreateLockFile(). We're not significantly further into the startup process, and failing to add the shared memory key to the file creates a hazardous situation. The hazard[4] keeping fcntl locking from replacing the PGSharedMemoryIsInUse() check does not apply here, because the postmaster itself does not run arbitrary code that might reopen postmaster.pid. This change adds some interlock to data directories shared over NFS. It also closes the occasionally-reported[5][6] problem that an empty postmaster.pid blocks cluster startup; we no longer face the possibility that an empty file is the just-created product of a concurrent postmaster. It's probably no longer necessary to fsync postmaster.pid, though I'm disinclined to change that straightway. Do any of you see holes in that design or have better ideas? Thanks, nm [1] http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=792b0f4666b6ea6346aa8d29b568e5d3fe1fcef5 [2] http://archives.postgresql.org/pgsql-hackers/2000-07/msg00359.php [3] http://archives.postgresql.org/pgsql-hackers/2000-11/msg01306.php [4] http://archives.postgresql.org/pgsql-hackers/2012-06/msg01598.php [5] http://archives.postgresql.org/pgsql-hackers/2010-08/msg01094.php [6] http://archives.postgresql.org/pgsql-hackers/2012-01/msg00233.php -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers