> On Oct 23, 2025, at 18:02, Arseniy Mukhin <[email protected]> 
> wrote:
> 
> Hi,
> 
> On Thu, Oct 23, 2025 at 11:17 AM Chao Li <[email protected]> wrote:
>> 
>> 
>> 
>>> On Oct 21, 2025, at 00:43, Arseniy Mukhin <[email protected]> 
>>> wrote:
>>> 
>>> 
>>> I managed to reproduce the race with v20-alt3. I tried to write a TAP
>>> test reproducing the issue, so it was easier to validate changes.
>>> Please find the attached TAP test. I added it to some random package
>>> for simplicity.
>>> 
>> 
>> With alt3, as we have acquired the notification lock after reading every 
>> message to update the POS, I think we can do a little bit more optimization:
>> 
>> The notifier: in SignalBackend()
>>    * Now we check if a listener’s pos equals to beforeWritePos, then we do 
>> “directly advancement”
>>    * We can change to if a listener’s pos is between beforeWritePos and 
>> afterWritePos, then we can do the advancement.
>> 
>> The listener: in asyncQueueReadAllNotifications():
>>    * With alt3, we only lock and update pos
>>    * We can do more. If current pos in shared memory is after that local 
>> pos, then meaning some notifier has done an advancement, so it can stop 
>> reading.
>> 
> 
> I think this would be a reasonable optimization if it weren't for the
> race condition mentioned above. The problem is that if the local pos
> lags behind the shared memory pos, it could point to a truncated queue
> segment, so we shouldn't allow that.
> 

I figured out a way to resolve the race condition for alt3:

* add an awakening flag for every listener, this flag is only set by listeners
* add an advisory pos for every listener, similar to alt1
* if a listener is awaken, notify only updates the listener’s advisory pos; 
otherwise directly advance its position.
* If a running listener see current pos is behind advisory pos, then stop 
reading

See more details in attach patch file, I added code comments for my changes. 
Now the TAP test won’t hit the race condition. 
```
# +++ tap check in src/test/authentication +++
t/008_listen-pos-race.pl .. skipped: Injection points not supported by this 
build
Files=1, Tests=0,  0 wallclock secs ( 0.00 usr  0.00 sys +  0.03 cusr  0.01 
csys =  0.04 CPU)
Result: NOTESTS
```

And with my solution, listeners longer will still use local pos, so that no 
longer need to acquire notification lock in every loop.

The patch stack is: v20 patch -> alt3 patch -> tap patch -> my patch. Please 
see if my solution works.

I also made a tiny change in the TAP script to allow it to terminate gracefully.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Attachment: fix-race.patch
Description: Binary data




Reply via email to