It could also affect other types of procedures but the probability is
very low I think.

You need to send two requests to the same table/peer at the same time,
especially that the two procedures must be polled out and both be
executed by different PEWorkers. And then when acquiring lock, one of
the procedures will lose the battle and then be put into the waiting
queue of the LockAndQueue instance for this table/peer. And then, the
winner procedure must schedule a sub procedure, which has the same
procedure type, and when the sub procedure finishes, it will clean up
the LockAndQueue instance so the procedure in the waiting queue can
never be executed again.

Most peer related procedures will hold exclusive lock for the whole
execution time, i.e, holdLock = true, and will schedule a
RefreshPeerProcedure which is the same type, so it is more likely to
hit this problem.

And yes, restarting master can fix the problem, as all the above
things are in memory state, after restarting, the 'lost' procedure
will be scheduled again.

Thanks.

Junegunn Choi <junegun...@gmail.com> 于2025年6月9日周一 21:40写道:
>
> Just my two cents.
>
> I'd consider two questions:
>
> * Does it only occur when attempting to remove multiple peer configurations
> within a short time frame? If so, I think the likelihood of hitting it in
> practice is quite low.
> * Is the system recoverable without service disruption? (e.g., by
> restarting the master server)
>
> If the answer to both is yes, then I'd say we can lower the priority.
>
> That said, as I was writing this email, I just realized you already
> submitted a fix.
> It's a solid refactoring, and the fix aligns well with your observation.
> Great job as always.
>
> Thanks,
> Junegunn

Reply via email to