Mark, I don't know if you'll find this worth trying to fix, but I revised the stress test this morning to be a bit more stressful. The biggest change I made (besides upping the number of subscribers) was that instead of adding each member to all the lists before adding the second member, it now adds all the members to one list successively before moving on the the next list. This causes a much lower amount of time between subscriptions per list, causing the config.pck files to be updated in a much smaller timeframe.
When I did this I saw that maybe 1-5% of the time a user was still omitted from the list (user was silently removed). I think that because these are processed by the queue runner on a different host and because the timestamp check is being done on an NFS stored file, there is potential that the qrunner for this doesn't yet have an updated mtime for that file (or even a small ntp time drift could cause this). When I commented out the caching part of the code in MailList.py this bug never seems to show up: #if mtime < self.__timestamp: # # File is not newer # return None, None So I think there may still be a race condition here, but the chances of it are unlikely that human interaction would trigger this. If however, you have a script that is subscribing users (one after another), this could still come up. I actually happen to have such a script, but I run it on the same host as the qrunners, so I haven't experienced this before. In my case I think it's probably not worth keeping the performance gain that the caching adds for sake of consistency. Attached is the modified stress test I'm using. Thanks again, -- Drew On Thu, September 29, 2011 4:13 am, Andrew Case wrote: > Mark, > > I realized I hadn't restarted my QRunners after this patch. It looks like > its working perfectly now! Even with a sleep of 0. Thanks so much! > > -- > Drew > > On Thu, September 29, 2011 2:52 am, Andrew Case wrote: >> Thanks Mark, see inline comments. >> >>>> [mailman@myhost] ~/logs |> grep testlist subscribe | grep acase >>>> Sep 28 17:15:14 2011 (4401) testlist: new ac...@example.com, admin >>>> mass >>>> sub >>>> Sep 28 17:19:36 2011 (5821) testlist: deleted ac...@example.com; >>>> member >>>> mgt page >>>> [mailman@myhost] ~/logs |> ../bin/list_members testlist | grep acase >>>> ac...@example.com >>>> [mailman@myhost] ~/logs |> >>> >>> >>> There is a bug in the Mailman 2.1 branch, but the above is not it. The >>> above log shows that ac...@example.com was added by admin mass >>> subscribe >>> at 17:15:14 and then a bit more than 4 minutes later, was removed by >>> checking the unsub box on the admin Membership List and submitting. >> >> I was trying to show that even after the user was removed, they're still >> listed as a member. >> >>> If you check your web server logs, you will find POST transactions to >>> the admin page for both these events. >> >> Agreed. >> >>>> The same also happens when subscribing. I will mass subscribe users >>>> (or >>>> when users confirm subscription via email/web), the logs indicated >>>> that >>>> they have been subscribed successfully, but then when I go look them >>>> up, >>>> they are not listed on the members list. >>>> >>>> This happens sporadically, but I am generally able to reproduce the >>>> error >>>> if I do it a couple times in a row. >>> >>> >>> This is possibly a manifestation of the bug, but I'm surprised it is >>> happening that frequently. >> >> Easiest way for me to replicated the problem is: >> 1) check the unsubscribe box for user A then hit submit >> 2) after reload check the unsubscribe box for user B then hit submit >> 3) reaload the "membership list" page and user B is back on the list >> >> This happens even after I wait a couple seconds in between each step. >> >>>> I'm suspicious there may be a locking issue and config.pck is >>>> reverting >>>> to >>>> config.pck.last. I found this thread rather helpful in analyzing >>>> potential problems, but I have yet to figure anything out: >>>> http://web.archiveorange.com/archive/v/IezAOgEQf7xEYCSEJTbD >>> >>> >>> The thread you point to above is relevant, but it is not a locking >>> issue. The problem is due to list caching in Mailman/Queue/Runner.py >>> and/or nearly concurrent processes which first load the list unlocked >>> and later lock it. The issue is that the resolution of the config.pck >>> timestamp is 1 second, and if a process has a list object and that list >>> object is updated by another process within the same second as the >>> timestamp on the first process's object, the first process won't load >>> the updated list when it locks it. This can result in things like a >>> subscribe being done and logged and then silently reversed. >> >> The result sounds the same, but would this happen even if I'm loading >> the >> page with more than a second in between each step outlined above? >> >>> List locking is working as it should. The issue is that the first >>> process doesn't reload the updated list when it acquires the lock >>> because it thinks it already has the latest version. >>> >>> I thought I had fixed this on the 2.1 branch, but it seems I only fixed >>> it for the now defunct 2.2 branch. >>> >>> A relevant thread starts at >>> <http://mail.python.org/pipermail/mailman-users/2008-August/062862.html> >>> and continues at >>> <http://mail.python.org/pipermail/mailman-developers/2008-August/020329.html> >>> >>> The patch in the attached cache.patch file should fix it. >> >> I applied the patch but it doesn't seem to have made a difference. >> >> >>>> In addition if I just run the following commands over and over, then >>>> the >>>> bug never seems to come up. This is part of why I am worrying about >>>> locking: >>>> bin/add_members ... >>>> bin/remove_members ... >>> >>> >>> That won't do it. bin/add_members alone will do it, but only if there >>> is >>> a nearly concurrent process updating the same list. >>> >>> >>>> Is there a good way to test locking between servers? I've run the >>>> tests/test_lockfile.py, but it reports it is OK. >>>> >>>> Any and all help would be GREATLY appreciated. We've been trying to >>>> triage this bug for weeks and it is terribly disruptive for our users. >>> >>> >>> The post at >>> <http://mail.python.org/pipermail/mailman-users/2008-August/062862.html> >>> contains a "stress test" that will probably reproduce the problem. >> >> Correct. Only one subscriber was subscribed to each test list. Keep in >> mind that in the stress test given if you use a sleep counter of 5 with >> 6 >> lists, that means you're waiting _30 seconds_ before the next add_member >> command is run for that list (I'm assume the timing issue is per-list, >> not >> per run of add_members). Even if you set the timer down to 1 that's a 6 >> second sleep. This shouldn't effect a cache that we're comparing for >> the >> given second. Anyway, my script ran fine with the 5 second sleep (30 >> seconds per list add), but showed discrepancies with a 3 second sleep. >> >>> I suspect your Mailman server must be very busy for you to see this bug >>> that frequently. However, it looks like I need to install the fix for >>> Mailman 2.1.15. >> >> We run about 600 different mailing lists for our department and this has >> been a continues headache. I appreciate all the hard work you guys do. >> >>> It is also curious that the only reports of this that I can recall both >>> come from solaris users. There may be complications in your case due to >>> NFS, but locking shouldn't be the issue. Run the stress test and see if >>> it fails. If it does, try the patch. >> >> Patch didn't seem to help. Is there an easy way to omit the caching in >> this? >> >> Thanks, >> -- >> Drew >> >>> >>> Let us know what happens. >>> >>> -- >>> Mark Sapiro <m...@msapiro.net> The highway is for gamblers, >>> San Francisco Bay Area, California better use your sense - B. Dylan >>> >>> >> >> > > > Andrew Case > Systems Administrator > Courant Institute of Mathematical Sciences > New York University > 251 Mercer St., Room 1023 > New York, NY 10012-1110 > Phone: 212-998-3147 > > Andrew Case Systems Administrator Courant Institute of Mathematical Sciences New York University 251 Mercer St., Room 1023 New York, NY 10012-1110 Phone: 212-998-3147
test_subscribe
Description: Binary data
_______________________________________________ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://wiki.list.org/x/AgA3 Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://wiki.list.org/x/QIA9