On Oct 29, 2019, at 18:50, Kyotaro Horiguchi <horikyota....@gmail.com <mailto:horikyota....@gmail.com>> wrote: > > Hello. > > At Fri, 25 Oct 2019 15:18:34 +0800, "Dongming Liu" > <lingce....@alibaba-inc.com <mailto:lingce....@alibaba-inc.com>> wrote in >> >> Hi, >> >> I recently discovered two possible bugs about synchronous replication. >> >> 1. SyncRepCleanupAtProcExit may delete an element that has been deleted >> SyncRepCleanupAtProcExit first checks whether the queue is detached, if it >> is not detached, >> acquires the SyncRepLock lock and deletes it. If this element has been >> deleted by walsender, >> it will be deleted repeatedly, SHMQueueDelete will core with a segment >> fault. >> >> IMO, like SyncRepCancelWait, we should lock the SyncRepLock first and then >> check >> whether the queue is detached or not. > > I think you're right here.
Thanks. > >> 2. SyncRepWaitForLSN may not call SyncRepCancelWait if ereport check one >> interrupt. >> For SyncRepWaitForLSN, if a query cancel interrupt arrives, we just >> terminate the wait >> with suitable warning. As follows: >> >> a. set QueryCancelPending to false >> b. errport outputs one warning >> c. calls SyncRepCancelWait to delete one element from the queue >> >> If another cancel interrupt arrives when we are outputting warning at step >> b, the errfinish >> will call CHECK_FOR_INTERRUPTS that will output an ERROR, such as "canceling >> autovacuum >> task", then the process will jump to the sigsetjmp. Unfortunately, the step >> c will be skipped >> and the element that should be deleted by SyncRepCancelWait is remained. >> >> The easiest way to fix this is to swap the order of step b and step c. On >> the other hand, >> let sigsetjmp clean up the queue may also be a good choice. What do you >> think? >> >> Attached the patch, any feedback is greatly appreciated. > > This is not right. It is in transaction commit so it is in a > HOLD_INTERRUPTS section. ProcessInterrupt does not respond to > cancel/die interrupts thus the ereport should return. I read the relevant code, you are right. I called SyncRepWaitForLSN somewhere else, but forgot to put it in a HOLD_INTERRUPTS and triggered an exception. regards. — Dongming Liu
smime.p7s
Description: S/MIME cryptographic signature