On 04/06/2018 07:22 PM, Andres Freund wrote: > Hi, > > On 2018-04-06 14:34:43 +0200, Tomas Vondra wrote: >>> Oh, that's not my intention either -- I just wanted to make sure I >>> was thinking about the same issue you were. > >> I agree we shouldn't rely on chance here - if we might read a stale >> value, we need to fix that of course. > > It's perfectly possible that some side-conditions mitigate this. What > concerns me that > a) Nobody appears to have raised this issue beforehand, besides an > unlocked read of a critical variable being a fairly obvious > issue. This kind of thing needs to be carefully thought about. > b) If there's some "side channel" interlock, it's not documented. > > I noticed the issue because of an IM question about the general feature, > and I did a three minute skim and saw the read without a comment. >
All I can say is that I did consider this issue while reviewing the patch, and I've managed to convince myself it's not an issue (using the logic that I've just outlined here). Which is why I haven't raised it as an issue, because I don't think it is. You're right it might have been mentioned explicitly, of course. In any case, I wouldn't call LockBufHdr/UnlockBufHdr a "side channel" interlock. It's a pretty direct and intentional interlock, I think. > >> I'm not quite sure I fully understand the issue, though. I assume both >> LockBufHdr and UnlockBufHdr are memory barriers, so for bad things to >> happen the process would need to be already past LockBufHdr when the >> checksum version is updated. In which case it can use a stale version >> when writing the buffer out. Correct? > > Yes, they're are memory barriers. > Phew! ;-) > >> I wonder if that's actually a problem, considering the checksum worker >> will then overwrite all data with correct checksums anyway. So the other >> process would have to overwrite the buffer after checksum worker, at >> which point it'll have to go through LockBufHdr. > > Again, I'm not sure if there's some combination of issues that make this > not a problem in practice. I just *asked* if there's something > preventing this from being a problem. > > The really problematic case would be if it is possible for some process > to wait long enough, without executing a barrier implying operation, > that it'd try to write out a page that the checksum worker has already > passed over. > Sure. But what would that be? I can't think of anything. A process that modifies a buffer (or any other piece of shared state) without holding some sort of lock seems broken by default. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services