On Thu, Mar 01, 2018 at 01:26:32AM +0000, Tsunakawa, Takayuki wrote: > Our customer hit another bug of pg_rewind with PG 9.5. The attached > patch fixes this.
Sorry for my late reply. It took me unfortunately some time before coming back to this patch. > PROBLEM > ======================================== > > After a long run of successful pg_rewind, the synchronized standby > could not catch up the primary forever, emitting the following message > repeatedly: > > LOG: XX000: could not read from log segment 000000060000028A00000031, > offset 16384: No error Oops. > CAUSE > ======================================== > > If the primary removes WAL files that pg_rewind is going to get, > pg_rewind leaves 0-byte WAL files in the target directory here: > > [libpq_fetch.c] > case FILE_ACTION_COPY: > /* Truncate the old file out of the way, if any > */ > open_target_file(entry->path, true); > fetch_file_range(entry->path, 0, > entry->newsize); > break; > > pg_rewind completes successfully, create recovery.conf, and then start > the standby in the target cluster. walreceiver receives WAL records > and write them to the 0-byte WAL files. Finally, xlog reader > complains that he cannot read a WAL page. This part qualifies as a data corruption bug as some records are lost by the backend. > FIX > ======================================== > > pg_rewind deletes the file when it finds that the primary has deleted > it. The concept looks right to me. I think that this does not apply only to WAL segments though. You could have a temporary WAL segment that has been included in the chunk table, which is even more volatile, or a multixact file, a snapshot file, etc. In short anything which is not a relation file and could disappear between the moment the file map is built and the data is fetched from the source server. @@ -174,7 +173,7 @@ remove_target_file(const char *path) return; snprintf(dstpath, sizeof(dstpath), "%s/%s", datadir_target, path); - if (unlink(dstpath) != 0) + if (unlink(dstpath) != 0 && !ignore_error) I see. So the only reason why this flag exists is that if a file is large enough so as it is split into multiple chunks, then the first unlink will work but not the successive ones. One chunk can be at most 1 million bytes, which is why it is essential for WAL segments. Instead of ignoring *all* errors, let's ignore only ENOENT and rename ignore_error to missing_ok as well. You need to update the comment in receiveFileChunks where an entry gets deleted with basically what I mentioned above, say: "If a file has been deleted on the source, remove it on the target as well. Note that multiple unlink() calls may happen on the same file if multiple data chunks are associated with it, hence ignore unconditionally anything missing. If this file is not a relation data file, then it has been already truncated when creating the file chunk list at hte previous execution of the filemap." Adding also a comment on top of remove_target_file to explain what missing_ok does would be nice to keep track of what the code should do. The portion about whether data from pg_wal should be transfered or not is really an improvement that could come in a future version, and what you are proposing here is more general, so I think that we should go ahead with it. -- Michael
signature.asc
Description: PGP signature