On Oct 29, 2019, at 18:50, Kyotaro Horiguchi <horikyota....@gmail.com 
<mailto:horikyota....@gmail.com>> wrote:
> 
> Hello.
> 
> At Fri, 25 Oct 2019 15:18:34 +0800, "Dongming Liu" 
> <lingce....@alibaba-inc.com <mailto:lingce....@alibaba-inc.com>> wrote in 
>> 
>> Hi,
>> 
>> I recently discovered two possible bugs about synchronous replication.
>> 
>> 1. SyncRepCleanupAtProcExit may delete an element that has been deleted
>> SyncRepCleanupAtProcExit first checks whether the queue is detached, if it 
>> is not detached, 
>> acquires the SyncRepLock lock and deletes it. If this element has been 
>> deleted by walsender, 
>> it will be deleted repeatedly, SHMQueueDelete will core with a segment 
>> fault. 
>> 
>> IMO, like SyncRepCancelWait, we should lock the SyncRepLock first and then 
>> check
>> whether the queue is detached or not.
> 
> I think you're right here.

Thanks.

> 
>> 2. SyncRepWaitForLSN may not call SyncRepCancelWait if ereport check one 
>> interrupt.
>> For SyncRepWaitForLSN, if a query cancel interrupt arrives, we just 
>> terminate the wait 
>> with suitable warning. As follows:
>> 
>> a. set QueryCancelPending to false
>> b. errport outputs one warning
>> c. calls SyncRepCancelWait to delete one element from the queue
>> 
>> If another cancel interrupt arrives when we are outputting warning at step 
>> b, the errfinish
>> will call CHECK_FOR_INTERRUPTS that will output an ERROR, such as "canceling 
>> autovacuum
>> task", then the process will jump to the sigsetjmp. Unfortunately, the step 
>> c will be skipped
>> and the element that should be deleted by SyncRepCancelWait is remained.
>> 
>> The easiest way to fix this is to swap the order of step b and step c. On 
>> the other hand, 
>> let sigsetjmp clean up the queue may also be a good choice. What do you 
>> think?
>> 
>> Attached the patch, any feedback is greatly appreciated.
> 
> This is not right. It is in transaction commit so it is in a
> HOLD_INTERRUPTS section. ProcessInterrupt does not respond to
> cancel/die interrupts thus the ereport should return.

I read the relevant code, you are right. I called SyncRepWaitForLSN somewhere 
else, 
but forgot to put it in a HOLD_INTERRUPTS and  triggered an exception.

regards.

—
Dongming Liu

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to