On 15/10/2010, at 1:23 PM, Thomas Uram wrote: > This fix also addresses the problem where bridges are not removed > from the registry. The RegistryPeer also uses the AGXMLRPCServer, > and relies on the timeout for cleaning up bridges that have timed > out. I haven't confirmed the fix in this case by testing, but it's > clearly borne out in the code. I'll test it tomorrow.
Tom, I just updated our registry machine and bridges are now being removed correctly. Looks like a good fix all round. I can't help thinking there are other bits of AG code that would benefit from a similar fix - the ftps server springs to mind (can't currently upload data to a venue on a server running with python2.6). chris > On Oct 14, 2010, at 8:47 PM, Christoph Willing wrote: > >> >> On 15/10/2010, at 6:10 AM, Thomas Uram wrote: >> >>> This has been fixed. I replicated the problem with a Bridge >>> running on Ubuntu Lucid, registered against the ANL bridge registry. >>> >>> This problem came down to a change in the request handling code in >>> Python 2.6. The change added a handle_timeout method to >>> SocketServer.BaseServer, which gets called instead of raising a >>> socket.timeout exception. The bridge code was relying on this >>> timeout exception to re-register with the registry. That >>> functionality has now been moved to the handle_timeout method. >>> >>> The change has been committed to the AG code here: >>> https://trac.ci.uchicago.edu/accessgrid/changeset/6820 >> >> >> Thanks Tom, >> >> Local testing confirms the fix works and I've just uploaded patched >> AG packages for Ubuntu 10.10 & Slackware 13.1 to their respective >> repos. Patched packages for other Ubuntu & Slackware versions >> should appear during today. >> >> >>> The relevant Python report is here: >>> http://bugs.python.org/issue742598 >>> >>> This does leave open the question of why the problem couldn't be >>> replicated in test setups using Python 2.6, as more than one of us >>> has done. >> >> I think there is additional aberrant behaviour under python2.6 in >> the registry itself which masks the issue fixed by the patch. >> You'll recall that with the APAG registry, the original fault >> wasn't seen i.e. bridges didn't disappear. It turns out that >> bridges aren't being removed at all in this case, even after they >> have been intentionally stopped, which means non-existent bridges >> are still being advertised. They can only be removed from the >> advertised list by restarting the registry. As an example, I had a >> bridge named SLTest2 registered with the APAG registry. I stooped >> that bridge over an hour ago and since then the machine has been >> rebooted twice while making new AG packages for different distros. >> Yet that same bridge still appears in the bridge list on another >> machine after a "Purge Bridge Cache". Its disabled and unreachable, >> so doesn't appear in a user's list under the Tools menu, but its >> clearly still being advertised by the registry. >> >> >> chris >> >> >> >>> On Oct 14, 2010, at 8:48 AM, John I. Quebedeaux, Jr wrote: >>> >>>> Chris, >>>> >>>> I can confirm that LSU is having to run an older version in order >>>> for our >>>> bridge not to disappear from the ANL registry. I haven't had time >>>> to figure >>>> out why it wasn't staying with our FC13 installation - so I've >>>> had to split >>>> the bridge and venueserver for the moment until I have time pick >>>> it apart... >>>> I initially suspected it was a python version issue... >>>> >>>> -John Q. >>>> -- >>>> John I. Quebedeaux, Jr.; Louisiana State University >>>> Computer Manager LBRN; 131 Life Sciences Bldg. >>>> e-mail: jo...@lsu.edu; web: http://lbrn.lsu.edu >>>> phone: 225-578-0062 / fax: 225-578-2597 >>>> >>>> >>>>> From: Christoph Willing <c.will...@uq.edu.au> >>>>> Date: Thu, 14 Oct 2010 21:09:12 +1000 >>>>> To: Philippe d'Anfray <philippe.d-anf...@cea.fr> >>>>> Cc: "<marcolino.pi...@ac-paris.fr>" <marcolino.pi...@ac-paris.fr>, >>>>> "ag-t...@mcs.anl.gov" <ag-t...@mcs.anl.gov> >>>>> Subject: Re: [AG-TECH] Vanishing Bridges >>>>> >>>>> >>>>> On 14/10/2010, at 7:12 AM, Christoph Willing wrote: >>>>> >>>>>> >>>>>> On 14/10/2010, at 2:13 AM, Thomas Uram wrote: >>>>>> >>>>>>> Last week I set up a test registry, registered a bridge with it, >>>>>>> and successively queried bridges from the registry all day >>>>>>> with no >>>>>>> trouble. Granted, these were all local, but if the problem >>>>>>> appears >>>>>>> as reliably as I've heard, I would have expected to see a >>>>>>> problem >>>>>>> even in this case. We clearly need to narrow down the cause of >>>>>>> the >>>>>>> problem some more. What details do we have about the failure >>>>>>> cases? >>>>>> >>>>>> >>>>>> We have very few details, unfortunately. I recall, nearly a year >>>>>> ago, I was able to replicate the problem and at that time I >>>>>> thought >>>>>> it may have something to do with newer python versions (since 2.6 >>>>>> was implicated in another problem I'd seen and the replicable >>>>>> cases >>>>>> were on newer systems which included python2.6). >>>>>> >>>>>> However when I was retesting a Debian lenny system (which uses >>>>>> python2.5) just night before last, I also ran a test with the new >>>>>> Ubuntu maverick (with python2.6). Both ran fine overnight i.e. >>>>>> maverick seems OK despite using python2.6 (however note that >>>>>> other >>>>>> tests in France were not successful with maverick, so ....). >>>>>> Anyway, >>>>>> since maverick had run OK for me, I then started a test with >>>>>> Ubuntu >>>>>> lucid (also python2.6), one of the systems with which I'd >>>>>> previously >>>>>> been able to replicate the problem. This time it has run >>>>>> overnight >>>>>> without any bridge disappearances - I just tried a bridge cache >>>>>> purge from home and it showed up fine (still showing up as >>>>>> "LucidTest" in the bridge list >>>>>> if the www.ap-accessgrid.org registry is enabled). >>>>> >>>>> >>>>> On re-reading this last line, I wondered if the problem has >>>>> something >>>>> to do with the registry itself. I guess all the failure >>>>> instances so >>>>> far have been using the default ANL registryUrl at >>>>> www.accessgrid.org/registry/peers.txt >>>>> , whereas my tests the last few days, which produced no >>>>> failures, all >>>>> used the APAG registryUrl at www.ap-accessgrid.org/registry/peers.txt >>>>> . >>>>> Obviously each points to a different registry so could that be the >>>>> problem? >>>>> >>>>> I spent all day today testing different _recent_ distros >>>>> (Slackware >>>>> 13.1, Ubuntu lucid & maverick) against the different registries. >>>>> In >>>>> all cases, bridges running against the ANL registry disappeared >>>>> within >>>>> 10-15 minutes. In all cases except one (not repeatable), bridges >>>>> running against the APAG registry did not disappear. >>>>> >>>>> My theory therefore is that ANL registry is running with an older >>>>> version of the AG toolkit that is not compatible with VenueClients >>>>> running newer AG versions. Tom's recent testing with a separate >>>>> test >>>>> registry supports this theory (assuming the test registry is >>>>> running a >>>>> recent version of AG toolkit). Philippe's comment that tests with >>>>> maverick were unsuccessful also supports the theory (assuming >>>>> those >>>>> tests used the default ANL registry). >>>>> >>>>> >>>>> Philippe and Tom (and anyone else interested), >>>>> >>>>> Could you try running (using the current AG release) a bridge >>>>> against >>>>> the APAG registry - some command like: >>>>> Bridge3.py --name=Testing123 --location=wherever >>>>> --registryUrl=http://www.ap-accessgrid.org/registry/peers.txt >>>>> >>>>> Leave it running for about an hour or two to confirm it does not >>>>> disappear. Then stop it and run it again, this time against the >>>>> ANL >>>>> registry with something like: >>>>> Bridge3.py --name=TestingXYZ --location=wherever >>>>> --registryUrl=http://www.accessgrid.org/registry/peers.txt >>>>> >>>>> Look for failure in the first 15 minutes. >>>>> >>>>> >>>>> If the fault is in the ANL registry, why do so many bridges _not_ >>>>> disappear? Looking at the list of bridges, the names are >>>>> becoming very >>>>> familiar i.e. they've been around a long time. I'm guessing that >>>>> these >>>>> bridges are running on older versions of the AG toolkit - still >>>>> compatible with whatever version is running on the ANL registry >>>>> machine. >>>>> >>>>> >>>>> Of course, if the test results are in line with the theory, it >>>>> still >>>>> doesn't explain the underlying cause. A quick look through >>>>> bridge & >>>>> registry related AG code doesn't reveal any recent changes so >>>>> the real >>>>> cause may actually be down in some of the supporting software >>>>> (python, >>>>> m2crypto anyone?) which are constantly updated in each new Linux >>>>> release (typically every 6 months). If so, this issue will >>>>> eventually >>>>> also bite Windows & Mac users as new OS versions introduce up to >>>>> date >>>>> versions of python, m2crypto etc. for them too. >>>>> >>>>> >>>>> chris >>>>> >>>>> >>>>>> So we know very little about failure cases; >>>>>> - there are many in France >>>>>> - I was previously able to replicate but not now >>>>>> - I _think_ I recall that Todd Z reported that he had seen the >>>>>> problem too >>>>>> >>>>>> chris >>>>>> >>>>>> >>>>>>> On Oct 13, 2010, at 2:27 AM, Philippe d'Anfray wrote: >>>>>>> >>>>>>>> Bonjour, >>>>>>>> >>>>>>>> I was not there yesterday and it's probably too late to >>>>>>>> "purge the >>>>>>>> cache" (there's just a Lucid test by now) >>>>>>>> >>>>>>>> By the time we decided to switch to debian because we have a >>>>>>>> seminar that will be transmitted >>>>>>>> tomorrow and really need the bridge to work (in fact to be >>>>>>>> visible >>>>>>>> to new users and there it is). >>>>>>>> >>>>>>>> If it works also with "maverick" it is a good news for other >>>>>>>> users >>>>>>>> in France (but in the first test we made the >>>>>>>> bridge disappears too...) >>>>>>>> >>>>>>>> Merci pour tout!! >>>>>>>> >>>>>>>> >>>>>>>> Philippe d'Anfray >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Le 12/10/2010 12:56, Christoph Willing a écrit : >>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> We're still stuck with this bridge problem, we tried with >>>>>>>>>>> Ubuntu 10.10 this afternoon but it is still the same. If you >>>>>>>>>>> can confirm us that it works fine with Debian, I'll >>>>>>>>>>> reconfigure >>>>>>>>>>> our server and install a Debian. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I'm just about to leave for a short holiday so I can't >>>>>>>>>> reconfirm >>>>>>>>>> that Debian still works correctly until late next week. >>>>>>>>> >>>>>>>>> I'm now running a test bridge with Debian "lenny". It has been >>>>>>>>> running nearly 5 hours without any problem so far. I'm also >>>>>>>>> running another test bridge using the new Ubuntu "maverick", >>>>>>>>> which and been running for over 4.5 hours - also no problem >>>>>>>>> yet. >>>>>>>>> I will let them both run overnight here (your day time) and >>>>>>>>> you >>>>>>>>> can check whether they're still running OK if you purge your >>>>>>>>> bridge cache (assuming you have www.ap-accessgrid.org as one >>>>>>>>> of >>>>>>>>> your bridge registries) and look for the bridges named DebTest >>>>>>>>> (Debian lenny 64bit) and MaverickTest (Ubuntu maverick 32bit). >>>>>>>>> >>>>>>>> <Philippe_d-Anfray.vcf> >>>>>>> >>>>>> >>>>>> Christoph Willing +61 7 3365 8316 >>>>>> QCIF Access Grid Manager >>>>>> University of Queensland >>>>>> >>>>> >>>>> Christoph Willing +61 7 3365 8316 >>>>> QCIF Access Grid Manager >>>>> University of Queensland >>>>> >>>> >>> >> >> Christoph Willing +61 7 3365 8316 >> QCIF Access Grid Manager >> University of Queensland >> > Christoph Willing +61 7 3365 8316 QCIF Access Grid Manager University of Queensland