Hi Martin, I'm really sorry! It seems I missed your mail! Thank you for answering and pointing me to the missed mail. I'll try to answer the questions.
Martin Feller wrote, on 06.02.2009 17:50: > Kay, > > Just to make sure: You upgraded to 4.0.8 AND applied the patch, right? > (Because 4.0.8 does not include that fix) Yes, we have upgraded to 4.0.8 and applied the patch. The patch help a bit. Since we applied it, the container runs for almost a day with normal load and then suddenly the load jumps up to 100% and sometimes more. I don't know if it is important but we are running SGE as cluster scheduler. > > Hm, a slow container, many factors ... > Some things that might help narrowing down the issue: > > * What exactly is slow, and in what situations: heavy container-load, > low load, no load? As far as I can tell it appears independently from the container-load within a day. We submit (with Gridway from another node) about 30 jobs every 6 hours with around 30 MB of data each. > * Is the persistence directory on a shared filesystem (e.g. NFS)? Yes, the homes are on a nfs share. > * What are the container thread settings in > $GLOBUS_LOCATION/etc/globus_wsrf_core/server-config.wsdd? > We experienced that, with default configuration, there were in some > cases only > 2 threads that handled incoming requests. > (check > http://www.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Performance_Guide.html) Ok, thanks for the hint. We are running globus with the default configuration. We will try changing it. > > * Are job deletions involved when it's getting slow? No, not as far as I can say. Globus itself dosn't seem to be slow. Its just consuming 100% of the CPU. Just some measurement with the high consumption: wsrf-query -x real 0m10.420s user 0m6.404s sys 0m3.376s globus-url-copy real 0m0.627s user 0m0.176s sys 0m0.260s globusrun-ws ls -la real 1m14.863s user 0m1.580s sys 0m2.256s After restarting globus: wsrf-query -x real 0m7.010s user 0m4.780s sys 0m1.460s globus-url-copy real 0m0.549s user 0m0.188s sys 0m0.160s globusrun-ws ls -la real 1m12.050s user 0m1.616s sys 0m2.156s > * Do you have a lot of persisted jobs/credentials/subscriptions in the > persistence directory? Around 400 in ManagedExecutableJobResourceStateType, around 850 in PersistentSubscription, around 100 in DelegationResource and almost none in ReliableFileTransferResource. We removed them as suggested in one of the previous mails but it didn't helped. > (by default ~<containeruser>/.globus/persisted/) > * In a situation where the container is slow: A JVM thread dump might > give us > some insight to see what the threads are actually doing. > (kill -QUIT <container-pid>) I'll send it in the next 24 hours, we restarted globus today... > > So, if you are willing to debug we could try to track some things. Of course we are! ;-) > > Also: Is there a way for you to try 4.2.1? I'd argue that Gram4 has far > less potential in 4.2 > to be the resource-hog in the container. I can point you to > documentation that describes > how to run Gram4 with a container with at most 200M usage and being > scalable nonetheless. > (Will be the default in 4.2.2) At the moment 4.2 is not an option because we have several services which are developed for 4.0. Thanks for helping! Cheers, Kay > > -Martin > > Kay Dörnemann wrote: >> Hi, >> >> first of all I want to thank you. We had the same problem and primarily >> your suggested fix helped but after approximately 24h the container >> process pegged 100% of the CPU again, until we restart it. It is fully >> functional but the system is slow. The only error message we found in >> the logs is: >> 2009-01-25 22:01:29,820 ERROR container.ServiceThread >> [ServiceThread-5370,run:297] Unexpected error during request processing >> java.lang.NullPointerException >> at >> org.globus.wsrf.container.GSIServiceThread.process(GSIServiceThread.java:151) >> >> at >> org.globus.wsrf.container.ServiceThread.run(ServiceThread.java:291) >> 2009-01-25 22:01:33,139 ERROR container.ServiceThread >> [ServiceThread-5353,run:297] Unexpected error during request processing >> java.lang.NullPointerException >> at >> org.globus.wsrf.container.GSIServiceThread.process(GSIServiceThread.java:151) >> >> at >> org.globus.wsrf.container.ServiceThread.run(ServiceThread.java:291) >> >> We tried to upgrade from 4.0.5 to 4.0.8 but it still persists. >> >> Any ideas? >> >> Cheers, >> >> Kay >> >> Patrick Armstrong wrote, on 16.01.2009 23:05: >>> On 16-Jan-09, at 1:10 PM, Martin Feller wrote: >>>> This sounds as if you are hitting >>>> http://bugzilla.globus.org/globus/show_bug.cgi?id=6341 >>> Yep! This was exactly the issue, as far as I can see. I patched the >>> system as described in the bug, and I get a nice container error now! >>> Much better than an infinite loop. >>> >>> I made a patch, and if anyone is running into this issue, you can patch >>> your system by running the following on your globus server: >>> >>> wget -qO- >>> https://particle.phys.uvic.ca/~patricka/globus-gram-local-proxy-tool.patch >>> >>> | patch $GLOBUS_LOCATION/libexec/globus-gram-local-proxy-tool >>> >>> --patrick >> -- Dipl.-Inform. Kay Dörnemann Distributed Systems Group | Information Systems Institute University of Marburg, Germany | University of Siegen, Germany Phone: +49-6421-28-21563 | +49-271-740-4075 Fax: +49-6421-28-21573 | +49-271-740-2372
signature.asc
Description: OpenPGP digital signature
