[...] > It is tricky. Each add_members, remove_members and web CGI post is a > separate process. If these processes are run sequentially, there should > not be any problem because each process will load the list, lock it > update it and save it before the next process loads it. > > The problem occurs when processes run concurrently. The scenario is > process A loads the list unlocked; process B locks the list and updates > it; process A tries to lock the list and gets the lock after process B > relinquishes it; if the timestamp on the config.pck from process B's > update is in the same second as the timestamp of process A's initial > load, process A thinks the list hasn't been updated and doesn't reload > it after obtaining the lock. Thus, when process A saves the list, > process B's changes are reversed. > > This is complicated by list caching in the qrunners because each qrunner > may have a cached copy of the list, so it can act as process A in the > above scenario with its cached copy playing the role of the initially > loaded list. To complicate this further, the qrunners get involved even > in the simple scenario with sequential commands because add_members, > remove_members and CGIs result in notices being sent, and the qrunner > processes that send the notices are running concurrently. This is why > the stress test will fail even though commands are run sequentially.
Thank you for that explanation. I did seem to have confusion as to when the qrunners cache and/or update these config.pck files and when the add/remove_members commands did as well. There seemed to be some sort of conflict between the two. [...] >>> The post at >>> <-> >>> contains a "stress test" that will probably reproduce the problem. >> >> Correct. Only one subscriber was subscribed to each test list. Keep in >> mind that in the stress test given if you use a sleep counter of 5 with >> 6 >> lists, that means you're waiting _30 seconds_ before the next add_member >> command is run for that list (I'm assume the timing issue is per-list, >> not >> per run of add_members). Even if you set the timer down to 1 that's a 6 >> second sleep. This shouldn't effect a cache that we're comparing for >> the >> given second. Anyway, my script ran fine with the 5 second sleep (30 >> seconds per list add), but showed discrepancies with a 3 second sleep. > > > So you are adding 'sleep' commands after each add_members? Yes I was. Without a sleep in between add_member calls, it was failing for ~50% of the calls to add_members. With a 5 second sleep it would tend to work most of the time. > I'm not sure what you're doing. Is there a different test elsewhere in > the thread? See my updated stress test that I sent you in my last email. > I have used a couple of tests as attached. They are the same except for > list order and are very similar to the one in the original thread. Note > that they contain only one sleep after all the add_members just to allow > things to settle before running list_members. That makes sense. >>> I suspect your Mailman server must be very busy for you to see this bug >>> that frequently. However, it looks like I need to install the fix for >>> Mailman 2.1.15. > > > Actually, I don't think the issue is the busy server. I think it is more > likely that NFS causes timing issues between add_members and > VirginRunner and OutgoingRunner that just make the bug more likely to > trigger. I think you hit the nail on the head here. It explains a lot. Thanks, -- Drew _______________________________________________ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://wiki.list.org/x/AgA3 Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://wiki.list.org/x/QIA9