Hello hackers, Let's try to get this issue resolved. Here is my position on the course of action we should take in back-branches:
1. I am -1 on back-patching the fd-transfer code. It's a significant change, and even when sufficiently debugged (I don't think it's there yet), we have no idea what will happen on all the kernels we support under extreme workloads. IMHO there is no way we can spring this on users in a point release. 2. I am +1 on back-patching Craig's PANIC-on-failure logic. Doing nothing is not an option I like. I have some feedback and changes to propose though; see attached. Responses to a review from Robert: On Thu, Jul 19, 2018 at 7:23 AM Robert Haas <robertmh...@gmail.com> wrote: > 2. I don't like promote_ioerr_to_panic() very much, partly because the > same pattern gets repeated over and over, and partly because it would > be awkwardly-named if we discovered that another 2 or 3 errors needed > similar handling (or some other variant handling). I suggest instead > having a function like report_critical_fsync_failure(char *path) that > does something like this: > > int elevel = ERROR; > if (errno == EIO) > elevel = PANIC; > ereport(elevel, > (errcode_for_file_access(), > errmsg("could not fsync file \"%s\": %m", path); > > And similarly I'd add report_critical_close_failure. In some cases, > this would remove wording variations (e.g. in twophase.c) but I think > that's fine, and maybe an improvement, as discussed on another recent > thread. I changed it to look like data_sync_elevel(ERROR) and made it treat all errnos the same. ENOSPC, EIO, EWOK, EIEIO, it makes no difference to the level of faith I have that my data still exists. > 3. slru.c calls pg_fsync() but isn't changed by the patch. That looks wrong. Fixed. > 4. The comment changes in snapbuild.c interact with the TODO that > immediately follows. I think more adjustment is needed here. I don't understand this. > 5. It seems odd that you adjusted the comment for > pg_fsync_no_writethrough() but not pg_fsync_writethrough() or > pg_fsync(). Either pg_fsync_writethrough() doesn't have the same > problem, in which case, awesome, but let's add a comment, or it does, > in which case it should refer to the other one. And I think > pg_fsync() itself needs a comment saying that every caller must be > careful to use promote_ioerr_to_panic() or > report_critical_fsync_failure() or whatever we end up calling it > unless the fsync is not critical for data integrity. I removed these comments and many others; I don't see the point in scattering descriptions of this problem and references to specific versions of Linux and -hackers archive links all over the place. I added a comment in one place, and also added some user documentation of the problem. > 6. In md.c, there's a stray blank line added. But more importantly, > the code just above that looks like this: > > if (!FILE_POSSIBLY_DELETED(errno) || > failures > 0) > - ereport(ERROR, > + ereport(promote_ioerr_to_panic(ERROR), > (errcode_for_file_access(), > errmsg("could not fsync file \"%s\": %m", > path))); > else > ereport(DEBUG1, > (errcode_for_file_access(), > errmsg("could not fsync file \"%s\" > but retrying: %m", > path))); > > I might be all wet here, but it seems like if we enter the bottom > branch, we still need the promote-to-panic behavior. That case only is only reached if FILE_POSSIBLY_DELETED() on the first time through the loop, and it detects an errno value not actually from fsync(). It's from FileSync(), when it tries to reopen a virtual fd and gets ENOENT, before calling fsync(). Code further down then absorbs incoming requests before checking if that was expected, closing a race. The comments could make that clearer, admittedly. > 7. The comment adjustment for SyncDataDirectory mentions an > "important" fact about fsync behavior, but then doesn't seem to change > any logic on that basis. I think in general a number of these > comments need a little more thought, but in this particular case, I > think we also need to consider what the behavior should be (and the > comment should reflect our considered judgement on that point, and the > implementation should match). I updated the comment. I don't think this is too relevant to the fsync() failure case, because we'll be rewriting all changes from the WAL again during recovery; I think this function is mostly useful for switching from fsync = off to fsync = on and restarting, not coping with previous fsync() failures by retrying (which we know to be useless anyway). Someone could argue that if you restarted after changing fsync from off to on, then this may be the first time you learn that write-back failed, and then you're somewhat screwed whether we panic or not, but I don't see any solution to that. Don't run databases with fsync = off. > 8. Andres suggested to me off-list that we should have a GUC to > disable the promote-to-panic behavior in case it turns out to be a > show-stopper for some user. I think that's probably a good idea. > Adding many new ways to PANIC in a minor release without providing any > way to go back to the old behavior sounds unfriendly. Surely, anyone > who suffers much from this has really serious other problems anyway, > but all the same I think we should provide an escape hatch. +1. See the new GUC data_sync_retry, defaulting to false. If set to true, we also need to fix the problem reported in [1], so here's the patch for that too. Other comments: I don't see why sync_file_range(SYNC_FILE_RANGE_WRITE) should get a pass here. Inspection of some version of the kernel might tell us it can't advance the error counter and report failure, but what do we gain by relying on that? Changed. FD_DELETE_AT_CLOSE is not a good way to detect temporary files in recent versions, as it doesn't detect the kind of shared temporary files used by Parallel Hash; FD_TEMP_FILE_LIMIT is a better way. Changed. (We could also just not bother exempting temporary files?) I plan to continue working on the fd-transfer system as part of a larger sync queue redesign project for 12. If we can get an agreement that we can't possibly back-patch the fd-transfer logic, then we can move all future discussion of that topic over to the other thread[2], and this thread can be about consensus to back-patch the PANIC patch. Thoughts? [1] https://www.postgresql.org/message-id/flat/87y3i1ia4w....@news-spur.riddles.org.uk [2] https://www.postgresql.org/message-id/flat/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmazxk0jykxw+b21f...@mail.gmail.com
0001-Don-t-forget-about-failed-fsync-requests-v4.patch
Description: Binary data
0002-PANIC-on-fsync-failure-v4.patch
Description: Binary data