Hi Andrew,


I was going to push out the fix to the Fedora RPMs once I got new packages 
working with Fedora 14, so I have the same source RPM on all, it might be in 
the yum repo tomorrow.





Doug



From: ag-tech-boun...@lists.mcs.anl.gov 
[mailto:ag-tech-boun...@lists.mcs.anl.gov] On Behalf Of Andrew Ford
Sent: Thursday, 4 November 2010 2:19 AM
To: Chris Willing
Cc: <marcolino.pi...@ac-paris.fr>; ag-t...@mcs.anl.gov
Subject: Re: [AG-TECH] Vanishing Bridges



Hi Chris,

Has this fix been pushed to all the different distribution repos? I'm still 
getting the issue (bridge registers, accessible to clients, then apparently 
disappears from registry after a few minutes) with bridges run on Ubuntu 9.04 
and Fedora 13, and trying to run Bridge3.py on an Ubuntu 10.10 box just hangs.

--Andrew

2010/10/14 Christoph Willing <c.will...@uq.edu.au<mailto:c.will...@uq.edu.au>>


On 15/10/2010, at 1:23 PM, Thomas Uram wrote:

This fix also addresses the problem where bridges are not removed from the 
registry. The RegistryPeer also uses the AGXMLRPCServer, and relies on the 
timeout for cleaning up bridges that have timed out. I haven't confirmed the 
fix in this case by testing, but it's clearly borne out in the code. I'll test 
it tomorrow.



Tom,

I just updated our registry machine and bridges are now being removed 
correctly. Looks like a good fix all round.

I can't help thinking there are other bits of AG code that would benefit from a 
similar fix - the ftps server springs to mind (can't currently upload data to a 
venue on a server running with python2.6).




chris



On Oct 14, 2010, at 8:47 PM, Christoph Willing wrote:


On 15/10/2010, at 6:10 AM, Thomas Uram wrote:

This has been fixed. I replicated the problem with a Bridge running on Ubuntu 
Lucid, registered against the ANL bridge registry.

This problem came down to a change in the request handling code in Python 2.6. 
The change added a handle_timeout method to SocketServer.BaseServer, which gets 
called instead of raising a socket.timeout exception. The bridge code was 
relying on this timeout exception to re-register with the registry. That 
functionality has now been moved to the handle_timeout method.

The change has been committed to the AG code here:
https://trac.ci.uchicago.edu/accessgrid/changeset/6820



Thanks Tom,

Local testing confirms the fix works and I've just uploaded patched AG packages 
for Ubuntu 10.10 & Slackware 13.1 to their respective repos. Patched packages 
for other Ubuntu & Slackware versions should appear during today.



The relevant Python report is here:
http://bugs.python.org/issue742598

This does leave open the question of why the problem couldn't be replicated in 
test setups using Python 2.6, as more than one of us has done.


I think there is additional aberrant behaviour under python2.6 in the registry 
itself which masks the issue fixed by the patch. You'll recall that with the 
APAG registry, the original fault wasn't seen i.e. bridges didn't disappear. It 
turns out that bridges aren't being removed at all in this case, even after 
they have been intentionally stopped, which means non-existent bridges are 
still being advertised. They can only be removed from the advertised list by 
restarting the registry. As an example, I had a bridge named SLTest2 registered 
with the APAG registry. I stooped that bridge over an hour ago and since then 
the machine has been rebooted twice while making new AG packages for different 
distros. Yet that same bridge still appears in the bridge list on another 
machine after a "Purge Bridge Cache". Its disabled and unreachable, so doesn't 
appear in a user's list under the Tools menu, but its clearly still being 
advertised by the registry.


chris




On Oct 14, 2010, at 8:48 AM, John I. Quebedeaux, Jr wrote:

Chris,

I can confirm that LSU is having to run an older version in order for our
bridge not to disappear from the ANL registry. I haven't had time to figure
out why it wasn't staying with our FC13 installation - so I've had to split
the bridge and venueserver for the moment until I have time pick it apart...
I initially suspected it was a python version issue...

-John Q.
--
John I. Quebedeaux, Jr.; Louisiana State University
Computer Manager LBRN; 131 Life Sciences Bldg.
e-mail: jo...@lsu.edu<mailto:jo...@lsu.edu>; web: http://lbrn.lsu.edu
phone: 225-578-0062 / fax: 225-578-2597



From: Christoph Willing <c.will...@uq.edu.au<mailto:c.will...@uq.edu.au>>
Date: Thu, 14 Oct 2010 21:09:12 +1000
To: Philippe d'Anfray 
<philippe.d-anf...@cea.fr<mailto:philippe.d-anf...@cea.fr>>
Cc: "<marcolino.pi...@ac-paris.fr<mailto:marcolino.pi...@ac-paris.fr>>" 
<marcolino.pi...@ac-paris.fr<mailto:marcolino.pi...@ac-paris.fr>>,
"ag-t...@mcs.anl.gov<mailto:ag-t...@mcs.anl.gov>" 
<ag-t...@mcs.anl.gov<mailto:ag-t...@mcs.anl.gov>>
Subject: Re: [AG-TECH] Vanishing Bridges


On 14/10/2010, at 7:12 AM, Christoph Willing wrote:


On 14/10/2010, at 2:13 AM, Thomas Uram wrote:

Last week I set up a test registry, registered a bridge with it,
and successively queried bridges from the registry all day with no
trouble. Granted, these were all local, but if the problem appears
as reliably as I've heard, I would have expected to see a problem
even in this case. We clearly need to narrow down the cause of the
problem some more. What details do we have about the failure cases?



We have very few details, unfortunately. I recall, nearly a year
ago, I was able to replicate the problem and at that time I thought
it may have something to do with newer python versions (since 2.6
was implicated in another problem I'd seen and the replicable cases
were on newer systems which included python2.6).

However when I was retesting a Debian lenny system (which uses
python2.5) just night before last, I also ran a test with the new
Ubuntu maverick (with python2.6). Both ran fine overnight i.e.
maverick seems OK despite using python2.6 (however note that other
tests in France were not successful with maverick, so ....). Anyway,
since maverick had run OK for me, I then started a test with Ubuntu
lucid (also python2.6), one of the systems with which I'd previously
been able to replicate the problem. This time it has run overnight
without any bridge disappearances - I just tried a bridge cache
purge from home and it showed up fine (still showing up as
"LucidTest" in the bridge list
if the www.ap-accessgrid.org<http://www.ap-accessgrid.org> registry is enabled).



On re-reading this last line, I wondered if the problem has something
to do with the registry itself. I guess all the failure instances so
far have been using the default ANL registryUrl at
www.accessgrid.org/registry/peers.txt<http://www.accessgrid.org/registry/peers.txt>
, whereas my tests the last few days, which produced no failures, all
used the APAG registryUrl at 
www.ap-accessgrid.org/registry/peers.txt<http://www.ap-accessgrid.org/registry/peers.txt>.
Obviously each points to a different registry so could that be the
problem?

I spent all day today testing different _recent_ distros (Slackware
13.1, Ubuntu lucid & maverick) against the different registries. In
all cases, bridges running against the ANL registry disappeared within
10-15 minutes. In all cases except one (not repeatable), bridges
running against the APAG registry did not disappear.

My theory therefore is that ANL registry is running with an older
version of the AG toolkit that is not compatible with VenueClients
running newer AG versions. Tom's recent testing with a separate test
registry supports this theory (assuming the test registry is running a
recent version of AG toolkit). Philippe's comment that tests with
maverick were unsuccessful also supports the theory (assuming those
tests used the default ANL registry).


Philippe and Tom (and anyone else interested),

Could you try running (using the current AG release) a bridge against
the APAG registry - some command like:
Bridge3.py --name=Testing123 --location=wherever
--registryUrl=http://www.ap-accessgrid.org/registry/peers.txt

Leave it running for about an hour or two to confirm it does not
disappear. Then stop it and run it again, this time against the ANL
registry with something like:
 Bridge3.py --name=TestingXYZ --location=wherever
--registryUrl=http://www.accessgrid.org/registry/peers.txt

Look for failure in the first 15 minutes.


If the fault is in the ANL registry, why do so many bridges _not_
disappear? Looking at the list of bridges, the names are becoming very
familiar i.e. they've been around a long time. I'm guessing that these
bridges are running on older versions of the AG toolkit - still
compatible with whatever version is running on the ANL registry machine.


Of course, if the test results are in line with the theory, it still
doesn't explain the underlying cause. A quick look through bridge &
registry related AG code doesn't reveal any recent changes so the real
cause may actually be down in some of the supporting software (python,
m2crypto anyone?) which are constantly updated in each new Linux
release (typically every 6 months). If so, this issue will eventually
also bite Windows & Mac users as new OS versions introduce up to date
versions of python, m2crypto etc. for them too.


chris



So we know very little about failure cases;
- there are many in France
- I was previously able to replicate but not now
- I _think_ I recall that Todd Z reported that he had seen the
problem too

chris



On Oct 13, 2010, at 2:27 AM, Philippe d'Anfray wrote:

Bonjour,

I was not there yesterday and it's probably too late to "purge the
cache" (there's just a Lucid test by now)

By the time we decided to switch to debian because we have a
seminar that will be transmitted
tomorrow and really need the bridge to work (in fact to be visible
to new users and there it is).

If it works also with "maverick" it is a good news for other users
in France (but in the first test we made the
bridge disappears too...)

Merci pour tout!!


Philippe d'Anfray




Le 12/10/2010 12:56, Christoph Willing a écrit :




      We're still stuck with this bridge problem, we tried with
      Ubuntu 10.10 this afternoon but it is still the same. If you
      can confirm us that it works fine with Debian, I'll reconfigure
      our server and install a Debian.



   I'm just about to leave for a short holiday so I can't reconfirm
   that Debian still works correctly until late next week.


   I'm now running a test bridge with Debian "lenny". It has been
   running nearly 5 hours without any problem so far. I'm also
   running another test bridge using the new Ubuntu "maverick",
   which and been running for over 4.5 hours - also no problem yet.
   I will let them both run overnight here (your day time) and you
   can check whether they're still running OK if you purge your
   bridge cache (assuming you have 
www.ap-accessgrid.org<http://www.ap-accessgrid.org> as one of
   your bridge registries) and look for the bridges named DebTest
   (Debian lenny 64bit) and MaverickTest (Ubuntu maverick 32bit).

   <Philippe_d-Anfray.vcf>




   Christoph Willing                       +61 7 3365 8316
   QCIF Access Grid Manager
   University of Queensland


   Christoph Willing                       +61 7 3365 8316
   QCIF Access Grid Manager
   University of Queensland






   Christoph Willing                       +61 7 3365 8316
   QCIF Access Grid Manager
   University of Queensland




   Christoph Willing                       +61 7 3365 8316
   QCIF Access Grid Manager
   University of Queensland



Reply via email to