On 04/06/2018 07:22 PM, Andres Freund wrote:
> Hi,
> On 2018-04-06 14:34:43 +0200, Tomas Vondra wrote:
>>> Oh, that's not my intention either -- I just wanted to make sure I
>>> was thinking about the same issue you were.
>> I agree we shouldn't rely on chance here - if we might read a stale
>> value, we need to fix that of course.
> It's perfectly possible that some side-conditions mitigate this. What
> concerns me that
> a) Nobody appears to have raised this issue beforehand, besides an
>    unlocked read of a critical variable being a fairly obvious
>    issue. This kind of thing needs to be carefully thought about.
> b) If there's some "side channel" interlock, it's not documented.
> I noticed the issue because of an IM question about the general feature,
> and I did a three minute skim and saw the read without a comment.

All I can say is that I did consider this issue while reviewing the
patch, and I've managed to convince myself it's not an issue (using the
logic that I've just outlined here). Which is why I haven't raised it as
an issue, because I don't think it is.

You're right it might have been mentioned explicitly, of course.

In any case, I wouldn't call LockBufHdr/UnlockBufHdr a "side channel"
interlock. It's a pretty direct and intentional interlock, I think.

>> I'm not quite sure I fully understand the issue, though. I assume both
>> LockBufHdr and UnlockBufHdr are memory barriers, so for bad things to
>> happen the process would need to be already past LockBufHdr when the
>> checksum version is updated. In which case it can use a stale version
>> when writing the buffer out. Correct?
> Yes, they're are memory barriers.

Phew! ;-)

>> I wonder if that's actually a problem, considering the checksum worker
>> will then overwrite all data with correct checksums anyway. So the other
>> process would have to overwrite the buffer after checksum worker, at
>> which point it'll have to go through LockBufHdr.
> Again, I'm not sure if there's some combination of issues that make this
> not a problem in practice. I just *asked* if there's something
> preventing this from being a problem.
> The really problematic case would be if it is possible for some process
> to wait long enough, without executing a barrier implying operation,
> that it'd try to write out a page that the checksum worker has already
> passed over.

Sure. But what would that be? I can't think of anything. A process that
modifies a buffer (or any other piece of shared state) without holding
some sort of lock seems broken by default.


Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply via email to