Great on the updates. - JQ
> From: Christoph Willing <c.will...@uq.edu.au> > Date: Fri, 15 Oct 2010 11:47:12 +1000 > To: "Thomas D. Uram" <tu...@mcs.anl.gov> > Cc: John Quebedeaux <jo...@lsu.edu>, Philippe d'Anfray > <philippe.d-anf...@cea.fr>, "<marcolino.pi...@ac-paris.fr>" > <marcolino.pi...@ac-paris.fr>, "ag-t...@mcs.anl.gov" <ag-t...@mcs.anl.gov> > Subject: Re: [AG-TECH] Vanishing Bridges > > > On 15/10/2010, at 6:10 AM, Thomas Uram wrote: > >> This has been fixed. I replicated the problem with a Bridge running >> on Ubuntu Lucid, registered against the ANL bridge registry. >> >> This problem came down to a change in the request handling code in >> Python 2.6. The change added a handle_timeout method to >> SocketServer.BaseServer, which gets called instead of raising a >> socket.timeout exception. The bridge code was relying on this >> timeout exception to re-register with the registry. That >> functionality has now been moved to the handle_timeout method. >> >> The change has been committed to the AG code here: >> https://trac.ci.uchicago.edu/accessgrid/changeset/6820 > > > Thanks Tom, > > Local testing confirms the fix works and I've just uploaded patched AG > packages for Ubuntu 10.10 & Slackware 13.1 to their respective repos. > Patched packages for other Ubuntu & Slackware versions should appear > during today. > > >> The relevant Python report is here: >> http://bugs.python.org/issue742598 >> >> This does leave open the question of why the problem couldn't be >> replicated in test setups using Python 2.6, as more than one of us >> has done. > > I think there is additional aberrant behaviour under python2.6 in the > registry itself which masks the issue fixed by the patch. You'll > recall that with the APAG registry, the original fault wasn't seen > i.e. bridges didn't disappear. It turns out that bridges aren't being > removed at all in this case, even after they have been intentionally > stopped, which means non-existent bridges are still being advertised. > They can only be removed from the advertised list by restarting the > registry. As an example, I had a bridge named SLTest2 registered with > the APAG registry. I stooped that bridge over an hour ago and since > then the machine has been rebooted twice while making new AG packages > for different distros. Yet that same bridge still appears in the > bridge list on another machine after a "Purge Bridge Cache". Its > disabled and unreachable, so doesn't appear in a user's list under the > Tools menu, but its clearly still being advertised by the registry. > > > chris > > > >> On Oct 14, 2010, at 8:48 AM, John I. Quebedeaux, Jr wrote: >> >>> Chris, >>> >>> I can confirm that LSU is having to run an older version in order >>> for our >>> bridge not to disappear from the ANL registry. I haven't had time >>> to figure >>> out why it wasn't staying with our FC13 installation - so I've had >>> to split >>> the bridge and venueserver for the moment until I have time pick it >>> apart... >>> I initially suspected it was a python version issue... >>> >>> -John Q. >>> -- >>> John I. Quebedeaux, Jr.; Louisiana State University >>> Computer Manager LBRN; 131 Life Sciences Bldg. >>> e-mail: jo...@lsu.edu; web: http://lbrn.lsu.edu >>> phone: 225-578-0062 / fax: 225-578-2597 >>> >>> >>>> From: Christoph Willing <c.will...@uq.edu.au> >>>> Date: Thu, 14 Oct 2010 21:09:12 +1000 >>>> To: Philippe d'Anfray <philippe.d-anf...@cea.fr> >>>> Cc: "<marcolino.pi...@ac-paris.fr>" <marcolino.pi...@ac-paris.fr>, >>>> "ag-t...@mcs.anl.gov" <ag-t...@mcs.anl.gov> >>>> Subject: Re: [AG-TECH] Vanishing Bridges >>>> >>>> >>>> On 14/10/2010, at 7:12 AM, Christoph Willing wrote: >>>> >>>>> >>>>> On 14/10/2010, at 2:13 AM, Thomas Uram wrote: >>>>> >>>>>> Last week I set up a test registry, registered a bridge with it, >>>>>> and successively queried bridges from the registry all day with no >>>>>> trouble. Granted, these were all local, but if the problem appears >>>>>> as reliably as I've heard, I would have expected to see a problem >>>>>> even in this case. We clearly need to narrow down the cause of the >>>>>> problem some more. What details do we have about the failure >>>>>> cases? >>>>> >>>>> >>>>> We have very few details, unfortunately. I recall, nearly a year >>>>> ago, I was able to replicate the problem and at that time I thought >>>>> it may have something to do with newer python versions (since 2.6 >>>>> was implicated in another problem I'd seen and the replicable cases >>>>> were on newer systems which included python2.6). >>>>> >>>>> However when I was retesting a Debian lenny system (which uses >>>>> python2.5) just night before last, I also ran a test with the new >>>>> Ubuntu maverick (with python2.6). Both ran fine overnight i.e. >>>>> maverick seems OK despite using python2.6 (however note that other >>>>> tests in France were not successful with maverick, so ....). >>>>> Anyway, >>>>> since maverick had run OK for me, I then started a test with Ubuntu >>>>> lucid (also python2.6), one of the systems with which I'd >>>>> previously >>>>> been able to replicate the problem. This time it has run overnight >>>>> without any bridge disappearances - I just tried a bridge cache >>>>> purge from home and it showed up fine (still showing up as >>>>> "LucidTest" in the bridge list >>>>> if the www.ap-accessgrid.org registry is enabled). >>>> >>>> >>>> On re-reading this last line, I wondered if the problem has >>>> something >>>> to do with the registry itself. I guess all the failure instances so >>>> far have been using the default ANL registryUrl at >>>> www.accessgrid.org/registry/peers.txt >>>> , whereas my tests the last few days, which produced no failures, >>>> all >>>> used the APAG registryUrl at www.ap-accessgrid.org/registry/peers.txt >>>> . >>>> Obviously each points to a different registry so could that be the >>>> problem? >>>> >>>> I spent all day today testing different _recent_ distros (Slackware >>>> 13.1, Ubuntu lucid & maverick) against the different registries. In >>>> all cases, bridges running against the ANL registry disappeared >>>> within >>>> 10-15 minutes. In all cases except one (not repeatable), bridges >>>> running against the APAG registry did not disappear. >>>> >>>> My theory therefore is that ANL registry is running with an older >>>> version of the AG toolkit that is not compatible with VenueClients >>>> running newer AG versions. Tom's recent testing with a separate test >>>> registry supports this theory (assuming the test registry is >>>> running a >>>> recent version of AG toolkit). Philippe's comment that tests with >>>> maverick were unsuccessful also supports the theory (assuming those >>>> tests used the default ANL registry). >>>> >>>> >>>> Philippe and Tom (and anyone else interested), >>>> >>>> Could you try running (using the current AG release) a bridge >>>> against >>>> the APAG registry - some command like: >>>> Bridge3.py --name=Testing123 --location=wherever >>>> --registryUrl=http://www.ap-accessgrid.org/registry/peers.txt >>>> >>>> Leave it running for about an hour or two to confirm it does not >>>> disappear. Then stop it and run it again, this time against the ANL >>>> registry with something like: >>>> Bridge3.py --name=TestingXYZ --location=wherever >>>> --registryUrl=http://www.accessgrid.org/registry/peers.txt >>>> >>>> Look for failure in the first 15 minutes. >>>> >>>> >>>> If the fault is in the ANL registry, why do so many bridges _not_ >>>> disappear? Looking at the list of bridges, the names are becoming >>>> very >>>> familiar i.e. they've been around a long time. I'm guessing that >>>> these >>>> bridges are running on older versions of the AG toolkit - still >>>> compatible with whatever version is running on the ANL registry >>>> machine. >>>> >>>> >>>> Of course, if the test results are in line with the theory, it still >>>> doesn't explain the underlying cause. A quick look through bridge & >>>> registry related AG code doesn't reveal any recent changes so the >>>> real >>>> cause may actually be down in some of the supporting software >>>> (python, >>>> m2crypto anyone?) which are constantly updated in each new Linux >>>> release (typically every 6 months). If so, this issue will >>>> eventually >>>> also bite Windows & Mac users as new OS versions introduce up to >>>> date >>>> versions of python, m2crypto etc. for them too. >>>> >>>> >>>> chris >>>> >>>> >>>>> So we know very little about failure cases; >>>>> - there are many in France >>>>> - I was previously able to replicate but not now >>>>> - I _think_ I recall that Todd Z reported that he had seen the >>>>> problem too >>>>> >>>>> chris >>>>> >>>>> >>>>>> On Oct 13, 2010, at 2:27 AM, Philippe d'Anfray wrote: >>>>>> >>>>>>> Bonjour, >>>>>>> >>>>>>> I was not there yesterday and it's probably too late to "purge >>>>>>> the >>>>>>> cache" (there's just a Lucid test by now) >>>>>>> >>>>>>> By the time we decided to switch to debian because we have a >>>>>>> seminar that will be transmitted >>>>>>> tomorrow and really need the bridge to work (in fact to be >>>>>>> visible >>>>>>> to new users and there it is). >>>>>>> >>>>>>> If it works also with "maverick" it is a good news for other >>>>>>> users >>>>>>> in France (but in the first test we made the >>>>>>> bridge disappears too...) >>>>>>> >>>>>>> Merci pour tout!! >>>>>>> >>>>>>> >>>>>>> Philippe d'Anfray >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Le 12/10/2010 12:56, Christoph Willing a écrit : >>>>>>>> >>>>>>>>>> >>>>>>>>>> We're still stuck with this bridge problem, we tried with >>>>>>>>>> Ubuntu 10.10 this afternoon but it is still the same. If you >>>>>>>>>> can confirm us that it works fine with Debian, I'll >>>>>>>>>> reconfigure >>>>>>>>>> our server and install a Debian. >>>>>>>>> >>>>>>>>> >>>>>>>>> I'm just about to leave for a short holiday so I can't >>>>>>>>> reconfirm >>>>>>>>> that Debian still works correctly until late next week. >>>>>>>> >>>>>>>> I'm now running a test bridge with Debian "lenny". It has been >>>>>>>> running nearly 5 hours without any problem so far. I'm also >>>>>>>> running another test bridge using the new Ubuntu "maverick", >>>>>>>> which and been running for over 4.5 hours - also no problem yet. >>>>>>>> I will let them both run overnight here (your day time) and you >>>>>>>> can check whether they're still running OK if you purge your >>>>>>>> bridge cache (assuming you have www.ap-accessgrid.org as one of >>>>>>>> your bridge registries) and look for the bridges named DebTest >>>>>>>> (Debian lenny 64bit) and MaverickTest (Ubuntu maverick 32bit). >>>>>>>> >>>>>>> <Philippe_d-Anfray.vcf> >>>>>> >>>>> >>>>> Christoph Willing +61 7 3365 8316 >>>>> QCIF Access Grid Manager >>>>> University of Queensland >>>>> >>>> >>>> Christoph Willing +61 7 3365 8316 >>>> QCIF Access Grid Manager >>>> University of Queensland >>>> >>> >> > > Christoph Willing +61 7 3365 8316 > QCIF Access Grid Manager > University of Queensland >