This has been fixed. I replicated the problem with a Bridge running on Ubuntu 
Lucid, registered against the ANL bridge registry.

This problem came down to a change in the request handling code in Python 2.6. 
The change added a handle_timeout method to SocketServer.BaseServer, which gets 
called instead of raising a socket.timeout exception. The bridge code was 
relying on this timeout exception to re-register with the registry. That 
functionality has now been moved to the handle_timeout method.

<http://bugs.python.org/issue742598>The change has been committed to the AG 
code here:
https://trac.ci.uchicago.edu/accessgrid/changeset/6820

The relevant Python report is here:
http://bugs.python.org/issue742598

This does leave open the question of why the problem couldn't be replicated in 
test setups using Python 2.6, as more than one of us has done.

Tom

<https://trac.ci.uchicago.edu/accessgrid/changeset/6820>

On Oct 14, 2010, at 8:48 AM, John I. Quebedeaux, Jr wrote:


   Chris,

   I can confirm that LSU is having to run an older version in order for our
   bridge not to disappear from the ANL registry. I haven't had time to figure
   out why it wasn't staying with our FC13 installation - so I've had to split
   the bridge and venueserver for the moment until I have time pick it apart...
   I initially suspected it was a python version issue...

   -John Q.
   --
   John I. Quebedeaux, Jr.; Louisiana State University
   Computer Manager LBRN; 131 Life Sciences Bldg.
   e-mail: jo...@lsu.edu<mailto:jo...@lsu.edu>; web: http://lbrn.lsu.edu
   phone: 225-578-0062 / fax: 225-578-2597




      From: Christoph Willing <c.will...@uq.edu.au<mailto:c.will...@uq.edu.au>>


      Date: Thu, 14 Oct 2010 21:09:12 +1000


      To: Philippe d'Anfray 
<philippe.d-anf...@cea.fr<mailto:philippe.d-anf...@cea.fr>>


      Cc: "<marcolino.pi...@ac-paris.fr<mailto:marcolino.pi...@ac-paris.fr>>" 
<marcolino.pi...@ac-paris.fr<mailto:marcolino.pi...@ac-paris.fr>>,


      "ag-t...@mcs.anl.gov<mailto:ag-t...@mcs.anl.gov>" 
<ag-t...@mcs.anl.gov<mailto:ag-t...@mcs.anl.gov>>


      Subject: Re: [AG-TECH] Vanishing Bridges




      On 14/10/2010, at 7:12 AM, Christoph Willing wrote:




         On 14/10/2010, at 2:13 AM, Thomas Uram wrote:



            Last week I set up a test registry, registered a bridge with it,


            and successively queried bridges from the registry all day with no


            trouble. Granted, these were all local, but if the problem appears


            as reliably as I've heard, I would have expected to see a problem


            even in this case. We clearly need to narrow down the cause of the


            problem some more. What details do we have about the failure cases?




         We have very few details, unfortunately. I recall, nearly a year


         ago, I was able to replicate the problem and at that time I thought


         it may have something to do with newer python versions (since 2.6


         was implicated in another problem I'd seen and the replicable cases


         were on newer systems which included python2.6).



         However when I was retesting a Debian lenny system (which uses


         python2.5) just night before last, I also ran a test with the new


         Ubuntu maverick (with python2.6). Both ran fine overnight i.e.


         maverick seems OK despite using python2.6 (however note that other


         tests in France were not successful with maverick, so ....). Anyway,


         since maverick had run OK for me, I then started a test with Ubuntu


         lucid (also python2.6), one of the systems with which I'd previously


         been able to replicate the problem. This time it has run overnight


         without any bridge disappearances - I just tried a bridge cache


         purge from home and it showed up fine (still showing up as


         "LucidTest" in the bridge list


         if the www.ap-accessgrid.org<http://www.ap-accessgrid.org> registry is 
enabled).




      On re-reading this last line, I wondered if the problem has something


      to do with the registry itself. I guess all the failure instances so


      far have been using the default ANL registryUrl at


      
www.accessgrid.org/registry/peers.txt<http://www.accessgrid.org/registry/peers.txt>


      , whereas my tests the last few days, which produced no failures, all


      used the APAG registryUrl at 
www.ap-accessgrid.org/registry/peers.txt<http://www.ap-accessgrid.org/registry/peers.txt>.


      Obviously each points to a different registry so could that be the


      problem?



      I spent all day today testing different _recent_ distros (Slackware


      13.1, Ubuntu lucid & maverick) against the different registries. In


      all cases, bridges running against the ANL registry disappeared within


      10-15 minutes. In all cases except one (not repeatable), bridges


      running against the APAG registry did not disappear.



      My theory therefore is that ANL registry is running with an older


      version of the AG toolkit that is not compatible with VenueClients


      running newer AG versions. Tom's recent testing with a separate test


      registry supports this theory (assuming the test registry is running a


      recent version of AG toolkit). Philippe's comment that tests with


      maverick were unsuccessful also supports the theory (assuming those


      tests used the default ANL registry).




      Philippe and Tom (and anyone else interested),



      Could you try running (using the current AG release) a bridge against


      the APAG registry - some command like:


        Bridge3.py --name=Testing123 --location=wherever


      --registryUrl=http://www.ap-accessgrid.org/registry/peers.txt



      Leave it running for about an hour or two to confirm it does not


      disappear. Then stop it and run it again, this time against the ANL


      registry with something like:


         Bridge3.py --name=TestingXYZ --location=wherever


      --registryUrl=http://www.accessgrid.org/registry/peers.txt



      Look for failure in the first 15 minutes.




      If the fault is in the ANL registry, why do so many bridges _not_


      disappear? Looking at the list of bridges, the names are becoming very


      familiar i.e. they've been around a long time. I'm guessing that these


      bridges are running on older versions of the AG toolkit - still


      compatible with whatever version is running on the ANL registry machine.




      Of course, if the test results are in line with the theory, it still


      doesn't explain the underlying cause. A quick look through bridge &


      registry related AG code doesn't reveal any recent changes so the real


      cause may actually be down in some of the supporting software (python,


      m2crypto anyone?) which are constantly updated in each new Linux


      release (typically every 6 months). If so, this issue will eventually


      also bite Windows & Mac users as new OS versions introduce up to date


      versions of python, m2crypto etc. for them too.




      chris




         So we know very little about failure cases;


         - there are many in France


         - I was previously able to replicate but not now


         - I _think_ I recall that Todd Z reported that he had seen the


         problem too



         chris




            On Oct 13, 2010, at 2:27 AM, Philippe d'Anfray wrote:



               Bonjour,



               I was not there yesterday and it's probably too late to "purge 
the


               cache" (there's just a Lucid test by now)



               By the time we decided to switch to debian because we have a


               seminar that will be transmitted


               tomorrow and really need the bridge to work (in fact to be 
visible


               to new users and there it is).



               If it works also with "maverick" it is a good news for other 
users


               in France (but in the first test we made the


               bridge disappears too...)



               Merci pour tout!!




               Philippe d'Anfray






               Le 12/10/2010 12:56, Christoph Willing a écrit :




                        We're still stuck with this bridge problem, we tried 
with


                        Ubuntu 10.10 this afternoon but it is still the same. 
If you


                        can confirm us that it works fine with Debian, I'll 
reconfigure


                        our server and install a Debian.




                     I'm just about to leave for a short holiday so I can't 
reconfirm


                     that Debian still works correctly until late next week.



                  I'm now running a test bridge with Debian "lenny". It has been


                  running nearly 5 hours without any problem so far. I'm also


                  running another test bridge using the new Ubuntu "maverick",


                  which and been running for over 4.5 hours - also no problem 
yet.


                  I will let them both run overnight here (your day time) and 
you


                  can check whether they're still running OK if you purge your


                  bridge cache (assuming you have 
www.ap-accessgrid.org<http://www.ap-accessgrid.org> as one of


                  your bridge registries) and look for the bridges named DebTest


                  (Debian lenny 64bit) and MaverickTest (Ubuntu maverick 32bit).



               <Philippe_d-Anfray.vcf>




         Christoph Willing                       +61 7 3365 8316


         QCIF Access Grid Manager


         University of Queensland




      Christoph Willing                       +61 7 3365 8316


      QCIF Access Grid Manager


      University of Queensland





Reply via email to