It could also affect other types of procedures but the probability is very low I think.
You need to send two requests to the same table/peer at the same time, especially that the two procedures must be polled out and both be executed by different PEWorkers. And then when acquiring lock, one of the procedures will lose the battle and then be put into the waiting queue of the LockAndQueue instance for this table/peer. And then, the winner procedure must schedule a sub procedure, which has the same procedure type, and when the sub procedure finishes, it will clean up the LockAndQueue instance so the procedure in the waiting queue can never be executed again. Most peer related procedures will hold exclusive lock for the whole execution time, i.e, holdLock = true, and will schedule a RefreshPeerProcedure which is the same type, so it is more likely to hit this problem. And yes, restarting master can fix the problem, as all the above things are in memory state, after restarting, the 'lost' procedure will be scheduled again. Thanks. Junegunn Choi <junegun...@gmail.com> 于2025年6月9日周一 21:40写道: > > Just my two cents. > > I'd consider two questions: > > * Does it only occur when attempting to remove multiple peer configurations > within a short time frame? If so, I think the likelihood of hitting it in > practice is quite low. > * Is the system recoverable without service disruption? (e.g., by > restarting the master server) > > If the answer to both is yes, then I'd say we can lower the priority. > > That said, as I was writing this email, I just realized you already > submitted a fix. > It's a solid refactoring, and the fix aligns well with your observation. > Great job as always. > > Thanks, > Junegunn