Re: [gt-user] Possible container bug

Kay Dörnemann Mon, 23 Feb 2009 04:37:48 -0800

Hi Martin,

I'm really sorry! It seems I missed your mail! Thank you for answering
and pointing me to the missed mail. I'll try to answer the questions.



Martin Feller wrote, on 06.02.2009 17:50:
> Kay,
> 
> Just to make sure: You upgraded to 4.0.8 AND applied the patch, right?
> (Because 4.0.8 does not include that fix)
Yes, we have upgraded to 4.0.8 and applied the patch. The patch help a
bit. Since we applied it, the container runs for almost a day with
normal load and then suddenly the load jumps up to 100% and sometimes more.

I don't know if it is important but we are running SGE as cluster
scheduler.
> 
> Hm, a slow container, many factors ...
> Some things that might help narrowing down the issue:
> 
> * What exactly is slow, and in what situations: heavy container-load,
> low load, no load?
As far as I can tell it appears independently from the container-load
within a day. We submit (with Gridway from another node) about 30 jobs
every 6 hours with around 30 MB of data each.
> * Is the persistence directory on a shared filesystem (e.g. NFS)?
Yes, the homes are on a nfs share.
> * What are the container thread settings in
> $GLOBUS_LOCATION/etc/globus_wsrf_core/server-config.wsdd?
>   We experienced that, with default configuration, there were in some
> cases only
>   2 threads that handled incoming requests.
>   (check
> http://www.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Performance_Guide.html)
Ok, thanks for the hint. We are running globus with the default
configuration. We will try changing it.
> 
> * Are job deletions involved when it's getting slow?
No, not as far as I can say. Globus itself dosn't seem to be slow. Its
just consuming 100% of the CPU.
Just some measurement with the high consumption:
wsrf-query -x
real    0m10.420s
user    0m6.404s
sys     0m3.376s

globus-url-copy
real    0m0.627s
user    0m0.176s
sys     0m0.260s

globusrun-ws ls -la
real    1m14.863s
user    0m1.580s
sys     0m2.256s

After restarting globus:
wsrf-query -x
real    0m7.010s
user    0m4.780s
sys     0m1.460s

globus-url-copy
real    0m0.549s
user    0m0.188s
sys     0m0.160s

globusrun-ws ls -la
real    1m12.050s
user    0m1.616s
sys     0m2.156s


> * Do you have a lot of persisted jobs/credentials/subscriptions in the
> persistence directory?
Around 400 in ManagedExecutableJobResourceStateType, around 850 in
PersistentSubscription, around 100 in DelegationResource and almost none
in ReliableFileTransferResource. We removed them as suggested in one of
the previous mails but it didn't helped.
>   (by default ~<containeruser>/.globus/persisted/)
> * In a situation where the container is slow: A JVM thread dump might
> give us
>   some insight to see what the threads are actually doing.
>   (kill -QUIT <container-pid>)
I'll send it in the next 24 hours, we restarted globus today...
> 
> So, if you are willing to debug we could try to track some things.
Of course we are! ;-)
> 
> Also: Is there a way for you to try 4.2.1? I'd argue that Gram4 has far
> less potential in 4.2
> to be the resource-hog in the container. I can point you to
> documentation that describes
> how to run Gram4 with a container with at most 200M usage and being
> scalable nonetheless.
> (Will be the default in 4.2.2)
At the moment 4.2 is not an option because we have several services
which are developed for 4.0.

Thanks for helping!

Cheers,

Kay
> 
> -Martin
> 
> Kay Dörnemann wrote:
>> Hi,
>>
>> first of all I want to thank you. We had the same problem and primarily
>> your suggested fix helped but after approximately 24h the container
>> process pegged 100% of the CPU again, until we restart it. It is fully
>> functional but the system is slow. The only error message we found in
>> the logs is:
>> 2009-01-25 22:01:29,820 ERROR container.ServiceThread
>> [ServiceThread-5370,run:297] Unexpected error during request processing
>> java.lang.NullPointerException
>>         at
>> org.globus.wsrf.container.GSIServiceThread.process(GSIServiceThread.java:151)
>>
>>         at
>> org.globus.wsrf.container.ServiceThread.run(ServiceThread.java:291)
>> 2009-01-25 22:01:33,139 ERROR container.ServiceThread
>> [ServiceThread-5353,run:297] Unexpected error during request processing
>> java.lang.NullPointerException
>>         at
>> org.globus.wsrf.container.GSIServiceThread.process(GSIServiceThread.java:151)
>>
>>         at
>> org.globus.wsrf.container.ServiceThread.run(ServiceThread.java:291)
>>
>> We tried to upgrade from 4.0.5 to 4.0.8 but it still persists.
>>
>> Any ideas?
>>
>> Cheers,
>>
>> Kay
>>
>> Patrick Armstrong wrote, on 16.01.2009 23:05:
>>> On 16-Jan-09, at 1:10 PM, Martin Feller wrote:
>>>> This sounds as if you are hitting
>>>> http://bugzilla.globus.org/globus/show_bug.cgi?id=6341
>>> Yep! This was exactly the issue, as far as I can see. I patched the
>>> system as described in the bug, and I get a nice container error now!
>>> Much better than an infinite loop.
>>>
>>> I made a patch, and if anyone is running into this issue, you can patch
>>> your system by running the following on your globus server:
>>>
>>> wget -qO-
>>> https://particle.phys.uvic.ca/~patricka/globus-gram-local-proxy-tool.patch
>>>
>>> | patch $GLOBUS_LOCATION/libexec/globus-gram-local-proxy-tool
>>>
>>> --patrick
>>

-- 
Dipl.-Inform. Kay Dörnemann
Distributed Systems Group      | Information Systems Institute
University of Marburg, Germany | University of Siegen, Germany
Phone: +49-6421-28-21563       | +49-271-740-4075
Fax:   +49-6421-28-21573       | +49-271-740-2372

signature.asc
Description: OpenPGP digital signature

Re: [gt-user] Possible container bug

Reply via email to