Hi Chris,

Has this fix been pushed to all the different distribution repos? I'm still 
getting the issue (bridge registers, accessible to clients, then apparently 
disappears from registry after a few minutes) with bridges run on Ubuntu 9.04 
and Fedora 13, and trying to run Bridge3.py on an Ubuntu 10.10 box just hangs.

--Andrew


2010/10/14 Christoph Willing <c.will...@uq.edu.au<mailto:c.will...@uq.edu.au>>



   On 15/10/2010, at 1:23 PM, Thomas Uram wrote:



      This fix also addresses the problem where bridges are not removed from 
the registry. The RegistryPeer also uses the AGXMLRPCServer, and relies on the 
timeout for cleaning up bridges that have timed out. I haven't confirmed the 
fix in this case by testing, but it's clearly borne out in the code. I'll test 
it tomorrow.




   Tom,

   I just updated our registry machine and bridges are now being removed 
correctly. Looks like a good fix all round.

   I can't help thinking there are other bits of AG code that would benefit 
from a similar fix - the ftps server springs to mind (can't currently upload 
data to a venue on a server running with python2.6).



   chris




      On Oct 14, 2010, at 8:47 PM, Christoph Willing wrote:




         On 15/10/2010, at 6:10 AM, Thomas Uram wrote:



            This has been fixed. I replicated the problem with a Bridge running 
on Ubuntu Lucid, registered against the ANL bridge registry.

            This problem came down to a change in the request handling code in 
Python 2.6. The change added a handle_timeout method to 
SocketServer.BaseServer, which gets called instead of raising a socket.timeout 
exception. The bridge code was relying on this timeout exception to re-register 
with the registry. That functionality has now been moved to the handle_timeout 
method.

            The change has been committed to the AG code here:
            https://trac.ci.uchicago.edu/accessgrid/changeset/6820




         Thanks Tom,

         Local testing confirms the fix works and I've just uploaded patched AG 
packages for Ubuntu 10.10 & Slackware 13.1 to their respective repos. Patched 
packages for other Ubuntu & Slackware versions should appear during today.




            The relevant Python report is here:
            http://bugs.python.org/issue742598

            This does leave open the question of why the problem couldn't be 
replicated in test setups using Python 2.6, as more than one of us has done.



         I think there is additional aberrant behaviour under python2.6 in the 
registry itself which masks the issue fixed by the patch. You'll recall that 
with the APAG registry, the original fault wasn't seen i.e. bridges didn't 
disappear. It turns out that bridges aren't being removed at all in this case, 
even after they have been intentionally stopped, which means non-existent 
bridges are still being advertised. They can only be removed from the 
advertised list by restarting the registry. As an example, I had a bridge named 
SLTest2 registered with the APAG registry. I stooped that bridge over an hour 
ago and since then the machine has been rebooted twice while making new AG 
packages for different distros. Yet that same bridge still appears in the 
bridge list on another machine after a "Purge Bridge Cache". Its disabled and 
unreachable, so doesn't appear in a user's list under the Tools menu, but its 
clearly still being advertised by the registry.


         chris





            On Oct 14, 2010, at 8:48 AM, John I. Quebedeaux, Jr wrote:



               Chris,

               I can confirm that LSU is having to run an older version in 
order for our
               bridge not to disappear from the ANL registry. I haven't had 
time to figure
               out why it wasn't staying with our FC13 installation - so I've 
had to split
               the bridge and venueserver for the moment until I have time pick 
it apart...
               I initially suspected it was a python version issue...

               -John Q.
               --
               John I. Quebedeaux, Jr.; Louisiana State University
               Computer Manager LBRN; 131 Life Sciences Bldg.
               e-mail: jo...@lsu.edu<mailto:jo...@lsu.edu>; web: 
http://lbrn.lsu.edu
               phone: 225-578-0062 / fax: 225-578-2597




                  From: Christoph Willing 
<c.will...@uq.edu.au<mailto:c.will...@uq.edu.au>>
                  Date: Thu, 14 Oct 2010 21:09:12 +1000
                  To: Philippe d'Anfray 
<philippe.d-anf...@cea.fr<mailto:philippe.d-anf...@cea.fr>>
                  Cc: 
"<marcolino.pi...@ac-paris.fr<mailto:marcolino.pi...@ac-paris.fr>>" 
<marcolino.pi...@ac-paris.fr<mailto:marcolino.pi...@ac-paris.fr>>,
                  "ag-t...@mcs.anl.gov<mailto:ag-t...@mcs.anl.gov>" 
<ag-t...@mcs.anl.gov<mailto:ag-t...@mcs.anl.gov>>
                  Subject: Re: [AG-TECH] Vanishing Bridges


                  On 14/10/2010, at 7:12 AM, Christoph Willing wrote:




                     On 14/10/2010, at 2:13 AM, Thomas Uram wrote:



                        Last week I set up a test registry, registered a bridge 
with it,
                        and successively queried bridges from the registry all 
day with no
                        trouble. Granted, these were all local, but if the 
problem appears
                        as reliably as I've heard, I would have expected to see 
a problem
                        even in this case. We clearly need to narrow down the 
cause of the
                        problem some more. What details do we have about the 
failure cases?




                     We have very few details, unfortunately. I recall, nearly 
a year
                     ago, I was able to replicate the problem and at that time 
I thought
                     it may have something to do with newer python versions 
(since 2.6
                     was implicated in another problem I'd seen and the 
replicable cases
                     were on newer systems which included python2.6).

                     However when I was retesting a Debian lenny system (which 
uses
                     python2.5) just night before last, I also ran a test with 
the new
                     Ubuntu maverick (with python2.6). Both ran fine overnight 
i.e.
                     maverick seems OK despite using python2.6 (however note 
that other
                     tests in France were not successful with maverick, so 
....). Anyway,
                     since maverick had run OK for me, I then started a test 
with Ubuntu
                     lucid (also python2.6), one of the systems with which I'd 
previously
                     been able to replicate the problem. This time it has run 
overnight
                     without any bridge disappearances - I just tried a bridge 
cache
                     purge from home and it showed up fine (still showing up as
                     "LucidTest" in the bridge list
                     if the www.ap-accessgrid.org<http://www.ap-accessgrid.org> 
registry is enabled).




                  On re-reading this last line, I wondered if the problem has 
something
                  to do with the registry itself. I guess all the failure 
instances so
                  far have been using the default ANL registryUrl at
                  
www.accessgrid.org/registry/peers.txt<http://www.accessgrid.org/registry/peers.txt>
                  , whereas my tests the last few days, which produced no 
failures, all
                  used the APAG registryUrl at 
www.ap-accessgrid.org/registry/peers.txt<http://www.ap-accessgrid.org/registry/peers.txt>.
                  Obviously each points to a different registry so could that 
be the
                  problem?

                  I spent all day today testing different _recent_ distros 
(Slackware
                  13.1, Ubuntu lucid & maverick) against the different 
registries. In
                  all cases, bridges running against the ANL registry 
disappeared within
                  10-15 minutes. In all cases except one (not repeatable), 
bridges
                  running against the APAG registry did not disappear.

                  My theory therefore is that ANL registry is running with an 
older
                  version of the AG toolkit that is not compatible with 
VenueClients
                  running newer AG versions. Tom's recent testing with a 
separate test
                  registry supports this theory (assuming the test registry is 
running a
                  recent version of AG toolkit). Philippe's comment that tests 
with
                  maverick were unsuccessful also supports the theory (assuming 
those
                  tests used the default ANL registry).


                  Philippe and Tom (and anyone else interested),

                  Could you try running (using the current AG release) a bridge 
against
                  the APAG registry - some command like:
                  Bridge3.py --name=Testing123 --location=wherever
                  --registryUrl=http://www.ap-accessgrid.org/registry/peers.txt

                  Leave it running for about an hour or two to confirm it does 
not
                  disappear. Then stop it and run it again, this time against 
the ANL
                  registry with something like:
                   Bridge3.py --name=TestingXYZ --location=wherever
                  --registryUrl=http://www.accessgrid.org/registry/peers.txt

                  Look for failure in the first 15 minutes.


                  If the fault is in the ANL registry, why do so many bridges 
_not_
                  disappear? Looking at the list of bridges, the names are 
becoming very
                  familiar i.e. they've been around a long time. I'm guessing 
that these
                  bridges are running on older versions of the AG toolkit - 
still
                  compatible with whatever version is running on the ANL 
registry machine.


                  Of course, if the test results are in line with the theory, 
it still
                  doesn't explain the underlying cause. A quick look through 
bridge &
                  registry related AG code doesn't reveal any recent changes so 
the real
                  cause may actually be down in some of the supporting software 
(python,
                  m2crypto anyone?) which are constantly updated in each new 
Linux
                  release (typically every 6 months). If so, this issue will 
eventually
                  also bite Windows & Mac users as new OS versions introduce up 
to date
                  versions of python, m2crypto etc. for them too.


                  chris




                     So we know very little about failure cases;
                     - there are many in France
                     - I was previously able to replicate but not now
                     - I _think_ I recall that Todd Z reported that he had seen 
the
                     problem too

                     chris




                        On Oct 13, 2010, at 2:27 AM, Philippe d'Anfray wrote:



                           Bonjour,

                           I was not there yesterday and it's probably too late 
to "purge the
                           cache" (there's just a Lucid test by now)

                           By the time we decided to switch to debian because 
we have a
                           seminar that will be transmitted
                           tomorrow and really need the bridge to work (in fact 
to be visible
                           to new users and there it is).

                           If it works also with "maverick" it is a good news 
for other users
                           in France (but in the first test we made the
                           bridge disappears too...)

                           Merci pour tout!!


                           Philippe d'Anfray




                           Le 12/10/2010 12:56, Christoph Willing a écrit :




                                    We're still stuck with this bridge problem, 
we tried with
                                    Ubuntu 10.10 this afternoon but it is still 
the same. If you
                                    can confirm us that it works fine with 
Debian, I'll reconfigure
                                    our server and install a Debian.




                                 I'm just about to leave for a short holiday so 
I can't reconfirm
                                 that Debian still works correctly until late 
next week.



                              I'm now running a test bridge with Debian 
"lenny". It has been
                              running nearly 5 hours without any problem so 
far. I'm also
                              running another test bridge using the new Ubuntu 
"maverick",
                              which and been running for over 4.5 hours - also 
no problem yet.
                              I will let them both run overnight here (your day 
time) and you
                              can check whether they're still running OK if you 
purge your
                              bridge cache (assuming you have 
www.ap-accessgrid.org<http://www.ap-accessgrid.org> as one of
                              your bridge registries) and look for the bridges 
named DebTest
                              (Debian lenny 64bit) and MaverickTest (Ubuntu 
maverick 32bit).



                           <Philippe_d-Anfray.vcf>




                     Christoph Willing                       +61 7 3365 8316
                     QCIF Access Grid Manager
                     University of Queensland




                  Christoph Willing                       +61 7 3365 8316
                  QCIF Access Grid Manager
                  University of Queensland






         Christoph Willing                       +61 7 3365 8316
         QCIF Access Grid Manager
         University of Queensland





   Christoph Willing                       +61 7 3365 8316
   QCIF Access Grid Manager
   University of Queensland




Reply via email to