On Thu, Apr 5, 2018 at 11:41 PM, Andres Freund <and...@anarazel.de> wrote:
> Hi, > > On 2018-04-05 23:32:19 +0200, Magnus Hagander wrote: > > On Thu, Apr 5, 2018 at 11:23 PM, Andres Freund <and...@anarazel.de> > wrote: > > > Is there any sort of locking that guarantees that worker processes see > > > an up2date value of > > > DataChecksumsNeedWrite()/ControlFile->data_checksum_version? Afaict > > > there's not. So you can afaict end up with checksums being computed by > > > the worker, but concurrent writes missing them. The window is going to > > > be at most one missed checksum per process (as the unlocking of the > page > > > is a barrier) and is probably not easy to hit, but that's dangerous > > > enough. > > > > > > > So just to be clear of the case you're worried about. It's basically: > > Session #1 - sets checksums to inprogress > > Session #1 - starts dynamic background worker ("launcher") > > Launcher reads and enumerates pg_database > > Launcher starts worker in first database > > Worker processes first block of data in database > > And at this point, Session #2 has still not seen the "checksums > inprogress" > > flag and continues to write without checksums? > > Yes. I think there are some variations of that, but yes, that's pretty > much it. > > > > That seems like quite a long time to me -- is that really a problem? > > We don't generally build locking models that are only correct based on > likelihood. Especially not without a lengthy comment explaining that > analysis. > Oh, that's not my intention either -- I just wanted to make sure I was thinking about the same issue you were. Since you know a lot more about that type of interlocks than I do :) We already wait for all running transactions to finish before we start doing anything. Obviously transactions != buffer writes (and we have things like the checkpointer/bgwriter to consider). Is there something else that we could safely just *wait* for? I have no problem whatsoever if this is a long wait (given the total time). I mean to the point of "what if we just stick a sleep(10) in there" level waiting. Or can that somehow be cleanly solved using some of the new atomic operators? Or is that likely to cause the same kind of overhead as throwing a barrier in there? -- Magnus Hagander Me: https://www.hagander.net/ <http://www.hagander.net/> Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>