Howard Chu skrev: > [EMAIL PROTECTED] wrote: >> Hm, I hope I have found the race condition that causes this :-) I'm now >> running with the patch at the end to see if that solves it, only time >> will tell..
And we haven't seen any replication problems since this was implemented friday afternoon :-). But it is still a bit too early to conclude that it has been fixed. >> The race is that between the time selecting on the syncrepl socket is >> enabled by the call to connection_client_enable() and the release of the >> si_mutex a new message may arrive. If so, the next call to do_syncrepl >> may fail in its attempt to trylock the mutex and no-one will re-enable >> selecting on it again. My patch delays enabling of the socket until the >> mutex has been released, which looks safe to me. Or can the access to >> si->si_conn without a lock be a problem? > > How about just moving the enable to after the runqueue manipulation is > done? That would still leave a glitch in the window open, unless the initial trylock is changed to a regular lock as your checking message suggests. But, doing that could result in all the threads ending up waiting for the lock if do_syncrepl of a refresh only replication is configured to run too often. I assume the trylock is there to avoid this? What about starting out with deferring the next scheduled call, as a scheduled call is not wanted until the first has completed anyhow? And a rescedule is already being done before do_syncrepl finishes. Deferring should not be necessary if rtask->next_sched.tv_sec is zero, i.e it has already been deferred. That would remove these calls for all the cases when do_syncrepl is called as a result of a new message on a persistent syncrepl connection. > Just need to be sure that do_syncrepl() isn't entered again before > si->si_conn gets initialized. I'm not quite sure what you mean here. As I see it, the problem is that do_syncrepl() must not be called after si->si_conn is enabled and the lock is still held. > It also occurs to me that we probably don't even need to manipulate the > slapd runqueue in persist mode, when si->si_conn is already set. I.e., > in that case we can only have gotten here because of a listener event, > and not because of a runqueue schedule. That is probably true. Rein
