Chris, I can confirm that LSU is having to run an older version in order for our bridge not to disappear from the ANL registry. I haven't had time to figure out why it wasn't staying with our FC13 installation - so I've had to split the bridge and venueserver for the moment until I have time pick it apart... I initially suspected it was a python version issue...
-John Q. -- John I. Quebedeaux, Jr.; Louisiana State University Computer Manager LBRN; 131 Life Sciences Bldg. e-mail: jo...@lsu.edu; web: http://lbrn.lsu.edu phone: 225-578-0062 / fax: 225-578-2597 > From: Christoph Willing <c.will...@uq.edu.au> > Date: Thu, 14 Oct 2010 21:09:12 +1000 > To: Philippe d'Anfray <philippe.d-anf...@cea.fr> > Cc: "<marcolino.pi...@ac-paris.fr>" <marcolino.pi...@ac-paris.fr>, > "ag-t...@mcs.anl.gov" <ag-t...@mcs.anl.gov> > Subject: Re: [AG-TECH] Vanishing Bridges > > > On 14/10/2010, at 7:12 AM, Christoph Willing wrote: > >> >> On 14/10/2010, at 2:13 AM, Thomas Uram wrote: >> >>> Last week I set up a test registry, registered a bridge with it, >>> and successively queried bridges from the registry all day with no >>> trouble. Granted, these were all local, but if the problem appears >>> as reliably as I've heard, I would have expected to see a problem >>> even in this case. We clearly need to narrow down the cause of the >>> problem some more. What details do we have about the failure cases? >> >> >> We have very few details, unfortunately. I recall, nearly a year >> ago, I was able to replicate the problem and at that time I thought >> it may have something to do with newer python versions (since 2.6 >> was implicated in another problem I'd seen and the replicable cases >> were on newer systems which included python2.6). >> >> However when I was retesting a Debian lenny system (which uses >> python2.5) just night before last, I also ran a test with the new >> Ubuntu maverick (with python2.6). Both ran fine overnight i.e. >> maverick seems OK despite using python2.6 (however note that other >> tests in France were not successful with maverick, so ....). Anyway, >> since maverick had run OK for me, I then started a test with Ubuntu >> lucid (also python2.6), one of the systems with which I'd previously >> been able to replicate the problem. This time it has run overnight >> without any bridge disappearances - I just tried a bridge cache >> purge from home and it showed up fine (still showing up as >> "LucidTest" in the bridge list >> if the www.ap-accessgrid.org registry is enabled). > > > On re-reading this last line, I wondered if the problem has something > to do with the registry itself. I guess all the failure instances so > far have been using the default ANL registryUrl at > www.accessgrid.org/registry/peers.txt > , whereas my tests the last few days, which produced no failures, all > used the APAG registryUrl at www.ap-accessgrid.org/registry/peers.txt. > Obviously each points to a different registry so could that be the > problem? > > I spent all day today testing different _recent_ distros (Slackware > 13.1, Ubuntu lucid & maverick) against the different registries. In > all cases, bridges running against the ANL registry disappeared within > 10-15 minutes. In all cases except one (not repeatable), bridges > running against the APAG registry did not disappear. > > My theory therefore is that ANL registry is running with an older > version of the AG toolkit that is not compatible with VenueClients > running newer AG versions. Tom's recent testing with a separate test > registry supports this theory (assuming the test registry is running a > recent version of AG toolkit). Philippe's comment that tests with > maverick were unsuccessful also supports the theory (assuming those > tests used the default ANL registry). > > > Philippe and Tom (and anyone else interested), > > Could you try running (using the current AG release) a bridge against > the APAG registry - some command like: > Bridge3.py --name=Testing123 --location=wherever > --registryUrl=http://www.ap-accessgrid.org/registry/peers.txt > > Leave it running for about an hour or two to confirm it does not > disappear. Then stop it and run it again, this time against the ANL > registry with something like: > Bridge3.py --name=TestingXYZ --location=wherever > --registryUrl=http://www.accessgrid.org/registry/peers.txt > > Look for failure in the first 15 minutes. > > > If the fault is in the ANL registry, why do so many bridges _not_ > disappear? Looking at the list of bridges, the names are becoming very > familiar i.e. they've been around a long time. I'm guessing that these > bridges are running on older versions of the AG toolkit - still > compatible with whatever version is running on the ANL registry machine. > > > Of course, if the test results are in line with the theory, it still > doesn't explain the underlying cause. A quick look through bridge & > registry related AG code doesn't reveal any recent changes so the real > cause may actually be down in some of the supporting software (python, > m2crypto anyone?) which are constantly updated in each new Linux > release (typically every 6 months). If so, this issue will eventually > also bite Windows & Mac users as new OS versions introduce up to date > versions of python, m2crypto etc. for them too. > > > chris > > >> So we know very little about failure cases; >> - there are many in France >> - I was previously able to replicate but not now >> - I _think_ I recall that Todd Z reported that he had seen the >> problem too >> >> chris >> >> >>> On Oct 13, 2010, at 2:27 AM, Philippe d'Anfray wrote: >>> >>>> Bonjour, >>>> >>>> I was not there yesterday and it's probably too late to "purge the >>>> cache" (there's just a Lucid test by now) >>>> >>>> By the time we decided to switch to debian because we have a >>>> seminar that will be transmitted >>>> tomorrow and really need the bridge to work (in fact to be visible >>>> to new users and there it is). >>>> >>>> If it works also with "maverick" it is a good news for other users >>>> in France (but in the first test we made the >>>> bridge disappears too...) >>>> >>>> Merci pour tout!! >>>> >>>> >>>> Philippe d'Anfray >>>> >>>> >>>> >>>> >>>> Le 12/10/2010 12:56, Christoph Willing a écrit : >>>>> >>>>>>> >>>>>>> We're still stuck with this bridge problem, we tried with >>>>>>> Ubuntu 10.10 this afternoon but it is still the same. If you >>>>>>> can confirm us that it works fine with Debian, I'll reconfigure >>>>>>> our server and install a Debian. >>>>>> >>>>>> >>>>>> I'm just about to leave for a short holiday so I can't reconfirm >>>>>> that Debian still works correctly until late next week. >>>>> >>>>> I'm now running a test bridge with Debian "lenny". It has been >>>>> running nearly 5 hours without any problem so far. I'm also >>>>> running another test bridge using the new Ubuntu "maverick", >>>>> which and been running for over 4.5 hours - also no problem yet. >>>>> I will let them both run overnight here (your day time) and you >>>>> can check whether they're still running OK if you purge your >>>>> bridge cache (assuming you have www.ap-accessgrid.org as one of >>>>> your bridge registries) and look for the bridges named DebTest >>>>> (Debian lenny 64bit) and MaverickTest (Ubuntu maverick 32bit). >>>>> >>>> <Philippe_d-Anfray.vcf> >>> >> >> Christoph Willing +61 7 3365 8316 >> QCIF Access Grid Manager >> University of Queensland >> > > Christoph Willing +61 7 3365 8316 > QCIF Access Grid Manager > University of Queensland >