This is per a report by an EnterpriseDB customer and a bunch of
off-list analysis by Kevin Grittner and Rahila Syed.

Suppose you have a large relation with OID 123456.  There are segment
files 123456, 123456.1, and 123456.2.  Due to some kind of operating
system malfeasance, 123456.1 disappears; you are officially in
trouble.

Now, a funny thing happens.  The next time you call mdnblocks() on
this relation, which will probably happen pretty quickly since every
sequential scan does one, it will see that 123456 is a complete
segment and it will create an empty 123456.1.  It and any future
mdnblocks() calls will report that the length of the relation is equal
to the length of one full segment, because they don't check for the
next segment unless the current segment is completely full.

Now, if subsequent to this an index scan happens to sweep through and
try to fetch a block in 123456.2, it will work!  This happens because
_mdfd_getseg() doesn't care about the length of the segments; it only
cares whether or not they exist.  If 123456.1 were actually missing,
then we'd never test whether 123456.2 exists and we'd get an error.
But because mdnblocks() created 123456.1, _mdfd_getseg() is now quite
happy to fetch blocks in 123456.2; it considers the empty 123456.1
file to be a sufficient reason to look for 123456.2, and seeing that
file, and finding the block it wants therein, it happily returns that
block, blithely ignoring the fact that it passed over a completely .1
segment before returning a block from .2.  This is maybe not the
smartest thing ever.

The comment in mdnblocks.c says this:

                         * Because we pass O_CREAT, we will create the
next segment (with
                         * zero length) immediately, if the last
segment is of length
                         * RELSEG_SIZE.  While perhaps not strictly
necessary, this keeps
                         * the logic simple.

I don't really see how this "keeps the logic simple".  What it does is
allow sequential scans and index scans to have two completely
different notions of how many accessible blocks there are in the
relation.  Granted, this kind of thing really shouldn't happen, but
sometimes bad things do happen.  Therefore, I propose the attached
patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment: RM36310.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to