On Fri, Jan 22, 2016 at 9:26 AM, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote:
> Hi, > > On 01/22/2016 06:45 AM, Michael Paquier wrote: > > So, I have been playing with a Linux VM with VMware Fusion and on >> ext4 with data=ordered the renames are getting lost if the root >> folder is not fsync. By killing-9 the VM I am able to reproduce that >> really easily. >> > > Yep. Same experience here (with qemu-kvm VMs). > > Here are some comments about your patch after a look at the code. >> >> Regarding the additions in fsync_fname() in xlog.c: >> 1) In InstallXLogFileSegment, rename() will be called only if >> HAVE_WORKING_LINK is not used, which happens only on Windows and >> cygwin. We could add it for consistency, but it should be within the >> #else/#endif block. It is not critical as of now. >> 2) The call in RemoveXlogFile is not necessary, the rename happening >> only on Windows. >> > > Hmmm, OK. Are we sure HAVE_WORKING_LINK is false only on Windows, or could > there be some other platforms? And are we sure the file systems on those > platforms are safe without the fsync call? > > That is, while the report references ext4, there may be other file systems > with the same problem - ext4 was used mostly as it's the most widely used > Linux file system. > > 3) In exitArchiveRecovery if the rename is not made durable I think >> it does not matter much. Even if recovery.conf is the one present >> once the node restarts node would move back again to recovery, and >> actually we had better move back to recovery in this case, no? >> > > I'm strongly against this "optimization" - I'm more than happy to exchange > the one fsync for not having to manually fix the database after crash. > > I don't really see why switching back to recovery should be desirable in > this case? Imagine you have a warm/hot standby, and that you promote it to > master. The clients connect, start issuing commands and then the system > crashes and loses the recovery.conf rename. The system reboots, database > performs local recovery but then starts as a standby and starts rejecting > writes. That seems really weird to me. > > 4) StartupXLOG for the tablespace map. I don't think this one is >> needed as well. Even if the tablespace map is not removed after a >> power loss user would get an error telling that the file should be >> removed. >> > > Please no, for the same reasons as in (3). > > 5) For the one where haveBackupLabel || haveTblspcMap. If we do the >> fsync, we guarantee that there is no need to do again the recovery. >> But in case of a power loss, isn't it better to do the recovery again? >> > > Why would it be better? Why should we do something twice when we don't > have to? Had this not be reliable, then the whole recovery process is > fundamentally broken and we better fix it instead of merely putting a > band-aid on it. > > 6) For the one after XLogArchiveNotify() for the last partial >> segment of the old timeline, it doesn't really matter to not make the >> change persistent as this is mainly done because it is useful to >> identify that it is a partial segment. >> > > OK, although I still don't quite see why that should be a reason not to do > the fsync. It's not really going to give us any measurable performance > advantage (how often we do those fsyncs), so I'd vote to keep it and make > sure the partial segments are named accordingly. Less confusion is always > better. > > 7) In CancelBackup, this one is not needed as well I think. We would >> surely want to get back to recovery if those files remain after a >> power loss. >> > > I may be missing something, but why would we switch to recovery in this > case? > > >> For the ones in xlogarchive.c: >> 1) For KeepFileRestoredFromArchive, it does not matter here, we are >> renaming a file with a temporary name to a permanent name. Once the >> node restarts, we would see an extra temporary file if the rename >> was not effective. >> > > So we'll lose the segment (won't have it locally under the permanent > name), as we've already restored it and won't do that again. Is that really > a good thing to do? > > 2) XLogArchiveForceDone, the only bad thing that would happen here is >> to leave behind a .ready file instead of a .done file. I guess that we >> could have it though as an optimization to not have to archive again >> this file. >> > > Yes. > > >> For the one in pgarch.c: >> 1) In pgarch_archiveDone, we could add it as an optimization to >> actually let the server know that the segment has been already >> archived, preventing a retry. >> > > Not sure what you mean by "could add it as an optimization"? > > In timeline.c: >> 1) writeTimeLineHistoryFile, it does not matter much. In the worst >> case we would have just a dummy temporary file, and startup process >> would come back here (in exitArchiveRecovery() we may finish with an >> unnamed backup file similarly). >> > > OK > > 2) In writeTimeLineHistoryFile, similarly we don't need to care >> much, in WalRcvFetchTimeLineHistoryFiles recovery would take again >> the same path >> > > OK > > >> Thoughts? >> >> > Thanks for the review and comments. I think the question is whether we > only want to do the additional fsync() only when it ultimately may lead to > data loss, or even in cases where it may cause operational issues (e.g. > switching back to recovery needlessly). > > I'd vote for the latter, as I think it makes the database easier to > operate (less manual interventions) and the performance impact is 0 (as > those fsyncs are really rare). > Yeah, unless it gives a significant performance penalty, I'd agree that the latter seems like the better option of those. +1 for that way :) -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/