On Thu, 2009-06-04 at 10:02 -0400, Tom Lane wrote: > Simon Riggs <si...@2ndquadrant.com> writes: > > What seems strange about the various errors generated in bufpage.c is > > that they are marked as ERRORs, yet are executed within a critical > > section causing the system to PANIC. > > The reason we PANIC there is to reduce the probability that bad data > will be written back to disk. Of course, if the bad data was read off > disk in the first place, there's no hope --- but we have checks on > incoming pages for that.
We don't have checks for this on incoming pages: We only ever check the header. I think there is hope if we do have corruption. We don't need to PANIC, we can do what I've suggested instead. > What seems significantly more likely if we > detect a problem here is that we somehow corrupted the page while it > sits in shared buffers. So, there's some hope that the corruption will > not get back to disk, so long as we PANIC and thereby cause > shared-memory contents to be flushed. If the block is marked as dirty, then yes, I can see your point. I would prefer to PANIC than to lose data. If the block is *not* dirty, i.e. it has been trashed in some other way, then it is not likely to go to disk. Anybody re-reading the block will see the same corruption and die long before they can make the page dirty. So a corrupt, yet clean block is no reason to PANIC. If the block *is* dirty it might only be because of a hint bit change and there are various other ways to dirty a block that don't trigger full page validity checks. > > Votes? > > +1 for no change. > > We could make the page-read-time validation checks stricter, if there's > some specific pattern you're seeing that gets past those checks now. Don't know what the pattern is because the bloody things keep PANICing. Seriously, it does look like some kind of memory corruption issue, but still not sure if hardware or software related. But this thread is not about that issue, its about how we respond in the face of such issues. Main problem is no easy way to get rid of the corrupt block. You have to select out good data somehow then truncate. I like Alvaro's suggestion for the future, but mostly we need to be able to surgically remove the data block. Yes, I can do this without backend changes via normal levels of non-user level skullduggery. It would be good to have a check_page_validity option so that before we do anything to a page we do full validation, especially for example on PageAddItem(). With that turned on, I wouldn't mind PANICing, because at least you'd have reasonable proof that the corruption has happened recently and that PANICing may actually prevent the horse from escaping. Thanks, -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers