On Thu, Mar 4, 2021 at 10:01 PM Kyotaro Horiguchi <horikyota....@gmail.com> wrote: > The patch assumed that CHKPT_START/COMPLETE barrier are exclusively > used each other, but MarkBufferDirtyHint which delays checkpoint start > is called in RelationTruncate while delaying checkpoint completion. > That is not a strange nor harmful behavior. I changed delayChkpt to a > bitmap integer from an enum so that both barrier are separately > triggered. > > I'm not sure this is the way to go here, though. This fixes the issue > of a crash during RelationTruncate, but the issue of smgrtruncate > failure during RelationTruncate still remains (unless we treat that > failure as PANIC?).
I like this patch. As I understand it, we're currently cheating by allowing checkpoints to complete without necessarily flushing all of the pages that were dirty at the time we fixed the redo pointer out to disk. We think this is OK because we know that those pages are going to get truncated away, but it's not really OK because when the system starts up, it has to replay WAL starting from the checkpoint's redo pointer, but the state of the page is not the same as it was at the time when the redo pointer was the end of WAL, so redo fails. In the case described in http://postgr.es/m/byapr06mb63739b2692dc6dbb3c5f186cab...@byapr06mb6373.namprd06.prod.outlook.com modifications are made to the page before the redo pointer is fixed and those changes never make it to disk, but the truncation also never makes it to the disk either. With this patch, that can't happen, because no checkpoint can intervene between when we (1) decide we're not going to bother writing those dirty pages and (2) actually truncate them away. So either the pages will get written as part of the checkpoint, or else they'll be gone before the checkpoint completes. In the latter case, I suppose redo that would have modified those pages will just be skipped, thus dodging the problem. In RelationTruncate, I suggest that we ought to clear the delay-checkpoint flag before rather than after calling FreeSpaceMapVacuumRange. Since the free space map is not fully WAL-logged, anything we're doing there should be non-critical. Also, I think it might be better if MarkBufferDirtyHint stays closer to the existing coding and just uses a Boolean and an if-test to decide whether to clear the bit, instead of inventing a new mechanism. I don't really see anything wrong with the new mechanism, but I think it's better to keep the patch minimal. As you say, this doesn't fix the problem that truncation might fail. But as Andres and Sawada-san said, the solution to that is to get rid of the comments saying that it's OK for truncation to fail and make it a PANIC. However, I don't think that change needs to be part of this patch. Even if we do that, we still need to do this. And even if we do this, we still need to do that. -- Robert Haas EDB: http://www.enterprisedb.com