Jeff Janes wrote: > 12538 2014-06-17 12:10:02.925 PDT:LOG: JJ deleting 0xb66b20 5183 > 12498 UPDATE 2014-06-17 12:10:03.188 PDT:DETAIL: Could not open file > "pg_multixact/members/5183": No such file or directory. > 12561 UPDATE 2014-06-17 12:10:04.473 PDT:DETAIL: Could not open file > "pg_multixact/members/5183": No such file or directory. > 12572 UPDATE 2014-06-17 12:10:04.475 PDT:DETAIL: Could not open file > "pg_multixact/members/5183": No such file or directory. > > This problem was initially fairly easy to reproduce, but since I > started adding instrumentation specifically to catch it, it has become > devilishly hard to reproduce.
I think I see the problem here now, after letting this test rig run for a while. First, the fact that there are holes in members/ files because of the random order in deletion, in itself, seems harmless, because the files remaining in between will be deleted by a future vacuum. Now, the real problem is that we delete files during vacuum, but the state that marks those file as safely deletable is written as part of a checkpoint record, not by vacuum itself (vacuum writes its state in pg_database, but a checkpoint derives its info from a shared memory variable.) Taken together, this means that if there's a crash between the vacuum that does a deletion and the next checkpoint, we might attempt to read an offset file that is not supposed to be part of the live range -- but we forgot that because we didn't reach the point where we save the shmem state to disk. It seems to me that we need to keep the offsets files around until a checkpoint has written the "oldest" number to WAL. In other words we need additional state in shared memory: (a) what we currently store which is the oldest number as computed by vacuum (not safe to delete, but it's the number that the next checkpoint must write), and (b) the oldest number that the last checkpoint wrote (the safe deletion point). Another thing I noticed is that more than one vacuum process can try to run deletion simultaneously, at least if they crash frequently while they were doing deletion. I don't see that this is troublesome, even though they might attempt to delete the same files. Finally, I noticed that we first read the oldest offset file, then determine the member files to delete; then delete offset files, then delete member files. Which means that we would delete offset files that correspond to member files that we keep (assuming there is a crash in the middle of deletion). It seems to me we should first delete members, then delete offsets, if we wanted to be very safe about it. I don't think this really would matters much, if we were to do things safely as described above. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers