On 15/10/2010, at 1:23 PM, Thomas Uram wrote:

> This fix also addresses the problem where bridges are not removed
> from the registry. The RegistryPeer also uses the AGXMLRPCServer,
> and relies on the timeout for cleaning up bridges that have timed
> out. I haven't confirmed the fix in this case by testing, but it's
> clearly borne out in the code. I'll test it tomorrow.


Tom,

I just updated our registry machine and bridges are now being removed
correctly. Looks like a good fix all round.

I can't help thinking there are other bits of AG code that would
benefit from a similar fix - the ftps server springs to mind (can't
currently upload data to a venue on a server running with python2.6).


chris


> On Oct 14, 2010, at 8:47 PM, Christoph Willing wrote:
>
>>
>> On 15/10/2010, at 6:10 AM, Thomas Uram wrote:
>>
>>> This has been fixed. I replicated the problem with a Bridge
>>> running on Ubuntu Lucid, registered against the ANL bridge registry.
>>>
>>> This problem came down to a change in the request handling code in
>>> Python 2.6. The change added a handle_timeout method to
>>> SocketServer.BaseServer, which gets called instead of raising a
>>> socket.timeout exception. The bridge code was relying on this
>>> timeout exception to re-register with the registry. That
>>> functionality has now been moved to the handle_timeout method.
>>>
>>> The change has been committed to the AG code here:
>>> https://trac.ci.uchicago.edu/accessgrid/changeset/6820
>>
>>
>> Thanks Tom,
>>
>> Local testing confirms the fix works and I've just uploaded patched
>> AG packages for Ubuntu 10.10 & Slackware 13.1 to their respective
>> repos. Patched packages for other Ubuntu & Slackware versions
>> should appear during today.
>>
>>
>>> The relevant Python report is here:
>>> http://bugs.python.org/issue742598
>>>
>>> This does leave open the question of why the problem couldn't be
>>> replicated in test setups using Python 2.6, as more than one of us
>>> has done.
>>
>> I think there is additional aberrant behaviour under python2.6 in
>> the registry itself which masks the issue fixed by the patch.
>> You'll recall that with the APAG registry, the original fault
>> wasn't seen i.e. bridges didn't disappear. It turns out that
>> bridges aren't being removed at all in this case, even after they
>> have been intentionally stopped, which means non-existent bridges
>> are still being advertised. They can only be removed from the
>> advertised list by restarting the registry. As an example, I had a
>> bridge named SLTest2 registered with the APAG registry. I stooped
>> that bridge over an hour ago and since then the machine has been
>> rebooted twice while making new AG packages for different distros.
>> Yet that same bridge still appears in the bridge list on another
>> machine after a "Purge Bridge Cache". Its disabled and unreachable,
>> so doesn't appear in a user's list under the Tools menu, but its
>> clearly still being advertised by the registry.
>>
>>
>> chris
>>
>>
>>
>>> On Oct 14, 2010, at 8:48 AM, John I. Quebedeaux, Jr wrote:
>>>
>>>> Chris,
>>>>
>>>> I can confirm that LSU is having to run an older version in order
>>>> for our
>>>> bridge not to disappear from the ANL registry. I haven't had time
>>>> to figure
>>>> out why it wasn't staying with our FC13 installation - so I've
>>>> had to split
>>>> the bridge and venueserver for the moment until I have time pick
>>>> it apart...
>>>> I initially suspected it was a python version issue...
>>>>
>>>> -John Q.
>>>> --
>>>> John I. Quebedeaux, Jr.; Louisiana State University
>>>> Computer Manager LBRN; 131 Life Sciences Bldg.
>>>> e-mail: jo...@lsu.edu; web: http://lbrn.lsu.edu
>>>> phone: 225-578-0062 / fax: 225-578-2597
>>>>
>>>>
>>>>> From: Christoph Willing <c.will...@uq.edu.au>
>>>>> Date: Thu, 14 Oct 2010 21:09:12 +1000
>>>>> To: Philippe d'Anfray <philippe.d-anf...@cea.fr>
>>>>> Cc: "<marcolino.pi...@ac-paris.fr>" <marcolino.pi...@ac-paris.fr>,
>>>>> "ag-t...@mcs.anl.gov" <ag-t...@mcs.anl.gov>
>>>>> Subject: Re: [AG-TECH] Vanishing Bridges
>>>>>
>>>>>
>>>>> On 14/10/2010, at 7:12 AM, Christoph Willing wrote:
>>>>>
>>>>>>
>>>>>> On 14/10/2010, at 2:13 AM, Thomas Uram wrote:
>>>>>>
>>>>>>> Last week I set up a test registry, registered a bridge with it,
>>>>>>> and successively queried bridges from the registry all day
>>>>>>> with no
>>>>>>> trouble. Granted, these were all local, but if the problem
>>>>>>> appears
>>>>>>> as reliably as I've heard, I would have expected to see a
>>>>>>> problem
>>>>>>> even in this case. We clearly need to narrow down the cause of
>>>>>>> the
>>>>>>> problem some more. What details do we have about the failure
>>>>>>> cases?
>>>>>>
>>>>>>
>>>>>> We have very few details, unfortunately. I recall, nearly a year
>>>>>> ago, I was able to replicate the problem and at that time I
>>>>>> thought
>>>>>> it may have something to do with newer python versions (since 2.6
>>>>>> was implicated in another problem I'd seen and the replicable
>>>>>> cases
>>>>>> were on newer systems which included python2.6).
>>>>>>
>>>>>> However when I was retesting a Debian lenny system (which uses
>>>>>> python2.5) just night before last, I also ran a test with the new
>>>>>> Ubuntu maverick (with python2.6). Both ran fine overnight i.e.
>>>>>> maverick seems OK despite using python2.6 (however note that
>>>>>> other
>>>>>> tests in France were not successful with maverick, so ....).
>>>>>> Anyway,
>>>>>> since maverick had run OK for me, I then started a test with
>>>>>> Ubuntu
>>>>>> lucid (also python2.6), one of the systems with which I'd
>>>>>> previously
>>>>>> been able to replicate the problem. This time it has run
>>>>>> overnight
>>>>>> without any bridge disappearances - I just tried a bridge cache
>>>>>> purge from home and it showed up fine (still showing up as
>>>>>> "LucidTest" in the bridge list
>>>>>> if the www.ap-accessgrid.org registry is enabled).
>>>>>
>>>>>
>>>>> On re-reading this last line, I wondered if the problem has
>>>>> something
>>>>> to do with the registry itself. I guess all the failure
>>>>> instances so
>>>>> far have been using the default ANL registryUrl at
>>>>> www.accessgrid.org/registry/peers.txt
>>>>> , whereas my tests the last few days, which produced no
>>>>> failures, all
>>>>> used the APAG registryUrl at www.ap-accessgrid.org/registry/peers.txt
>>>>> .
>>>>> Obviously each points to a different registry so could that be the
>>>>> problem?
>>>>>
>>>>> I spent all day today testing different _recent_ distros
>>>>> (Slackware
>>>>> 13.1, Ubuntu lucid & maverick) against the different registries.
>>>>> In
>>>>> all cases, bridges running against the ANL registry disappeared
>>>>> within
>>>>> 10-15 minutes. In all cases except one (not repeatable), bridges
>>>>> running against the APAG registry did not disappear.
>>>>>
>>>>> My theory therefore is that ANL registry is running with an older
>>>>> version of the AG toolkit that is not compatible with VenueClients
>>>>> running newer AG versions. Tom's recent testing with a separate
>>>>> test
>>>>> registry supports this theory (assuming the test registry is
>>>>> running a
>>>>> recent version of AG toolkit). Philippe's comment that tests with
>>>>> maverick were unsuccessful also supports the theory (assuming
>>>>> those
>>>>> tests used the default ANL registry).
>>>>>
>>>>>
>>>>> Philippe and Tom (and anyone else interested),
>>>>>
>>>>> Could you try running (using the current AG release) a bridge
>>>>> against
>>>>> the APAG registry - some command like:
>>>>> Bridge3.py --name=Testing123 --location=wherever
>>>>> --registryUrl=http://www.ap-accessgrid.org/registry/peers.txt
>>>>>
>>>>> Leave it running for about an hour or two to confirm it does not
>>>>> disappear. Then stop it and run it again, this time against the
>>>>> ANL
>>>>> registry with something like:
>>>>>  Bridge3.py --name=TestingXYZ --location=wherever
>>>>> --registryUrl=http://www.accessgrid.org/registry/peers.txt
>>>>>
>>>>> Look for failure in the first 15 minutes.
>>>>>
>>>>>
>>>>> If the fault is in the ANL registry, why do so many bridges _not_
>>>>> disappear? Looking at the list of bridges, the names are
>>>>> becoming very
>>>>> familiar i.e. they've been around a long time. I'm guessing that
>>>>> these
>>>>> bridges are running on older versions of the AG toolkit - still
>>>>> compatible with whatever version is running on the ANL registry
>>>>> machine.
>>>>>
>>>>>
>>>>> Of course, if the test results are in line with the theory, it
>>>>> still
>>>>> doesn't explain the underlying cause. A quick look through
>>>>> bridge &
>>>>> registry related AG code doesn't reveal any recent changes so
>>>>> the real
>>>>> cause may actually be down in some of the supporting software
>>>>> (python,
>>>>> m2crypto anyone?) which are constantly updated in each new Linux
>>>>> release (typically every 6 months). If so, this issue will
>>>>> eventually
>>>>> also bite Windows & Mac users as new OS versions introduce up to
>>>>> date
>>>>> versions of python, m2crypto etc. for them too.
>>>>>
>>>>>
>>>>> chris
>>>>>
>>>>>
>>>>>> So we know very little about failure cases;
>>>>>> - there are many in France
>>>>>> - I was previously able to replicate but not now
>>>>>> - I _think_ I recall that Todd Z reported that he had seen the
>>>>>> problem too
>>>>>>
>>>>>> chris
>>>>>>
>>>>>>
>>>>>>> On Oct 13, 2010, at 2:27 AM, Philippe d'Anfray wrote:
>>>>>>>
>>>>>>>> Bonjour,
>>>>>>>>
>>>>>>>> I was not there yesterday and it's probably too late to
>>>>>>>> "purge the
>>>>>>>> cache" (there's just a Lucid test by now)
>>>>>>>>
>>>>>>>> By the time we decided to switch to debian because we have a
>>>>>>>> seminar that will be transmitted
>>>>>>>> tomorrow and really need the bridge to work (in fact to be
>>>>>>>> visible
>>>>>>>> to new users and there it is).
>>>>>>>>
>>>>>>>> If it works also with "maverick" it is a good news for other
>>>>>>>> users
>>>>>>>> in France (but in the first test we made the
>>>>>>>> bridge disappears too...)
>>>>>>>>
>>>>>>>> Merci pour tout!!
>>>>>>>>
>>>>>>>>
>>>>>>>> Philippe d'Anfray
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Le 12/10/2010 12:56, Christoph Willing a écrit :
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> We're still stuck with this bridge problem, we tried with
>>>>>>>>>>> Ubuntu 10.10 this afternoon but it is still the same. If you
>>>>>>>>>>> can confirm us that it works fine with Debian, I'll
>>>>>>>>>>> reconfigure
>>>>>>>>>>> our server and install a Debian.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I'm just about to leave for a short holiday so I can't
>>>>>>>>>> reconfirm
>>>>>>>>>> that Debian still works correctly until late next week.
>>>>>>>>>
>>>>>>>>> I'm now running a test bridge with Debian "lenny". It has been
>>>>>>>>> running nearly 5 hours without any problem so far. I'm also
>>>>>>>>> running another test bridge using the new Ubuntu "maverick",
>>>>>>>>> which and been running for over 4.5 hours - also no problem
>>>>>>>>> yet.
>>>>>>>>> I will let them both run overnight here (your day time) and
>>>>>>>>> you
>>>>>>>>> can check whether they're still running OK if you purge your
>>>>>>>>> bridge cache (assuming you have www.ap-accessgrid.org as one
>>>>>>>>> of
>>>>>>>>> your bridge registries) and look for the bridges named DebTest
>>>>>>>>> (Debian lenny 64bit) and MaverickTest (Ubuntu maverick 32bit).
>>>>>>>>>
>>>>>>>> <Philippe_d-Anfray.vcf>
>>>>>>>
>>>>>>
>>>>>> Christoph Willing                       +61 7 3365 8316
>>>>>> QCIF Access Grid Manager
>>>>>> University of Queensland
>>>>>>
>>>>>
>>>>> Christoph Willing                       +61 7 3365 8316
>>>>> QCIF Access Grid Manager
>>>>> University of Queensland
>>>>>
>>>>
>>>
>>
>> Christoph Willing                       +61 7 3365 8316
>> QCIF Access Grid Manager
>> University of Queensland
>>
>

Christoph Willing                       +61 7 3365 8316
QCIF Access Grid Manager
University of Queensland

Reply via email to