Christoph:
Some questions:
- what happens if you submit a simple job without streaming:
globusrun-ws -submit -F hydra -c /bin/date
- what happens if you submit a staging job without streaming as
described in
http://www.globus.org/toolkit/docs/4.0/admin/docbook/quickstart.html#q-gram2
- back to your job: can you turn on full debug mode in the
container and send the logfile?
($GLOBUS_LOCATION/container-log4j.properties:
log4j.category.org.globus=DEBUG)
If your GT installation is not in production use: please remove
the persistence data in ~/.globus/persisted of the user who runs
the container before you start the container in loglevel DEBUG
Your problem is quite valuable for us because this "Waiting to be
Done or Failed" issue is critical for us and we never had been able to
really reproduce that in simple and single job submissions
(see http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=5247).
Martin
hello again!
After doing some more research, we encountered that we keep getting the
following debug-messages in our container.log:
2007-06-13 13:45:30,009 DEBUG
ManagedExecutableJobResource.5f9ea370-78d9-11db-b934-9189e81827f7
[Thread-3,remove:296] Waiting to be Done or Failed. Current state:
FailureFileCleanUpResponse
i added the debug-flag to the command mentioned below (globusrun-ws
-submit -dbg -F hydra -s -c /bin/hostname) and i get messages like this
(till i cancel the job):
...
debug: operation complete
Canceling...debug: starting to get
gsiftp://hydra.gup.uni-linz.ac.at:2811/home/local/agrid/agp11092/dc6d8dec-19a3-11dc-9b58-0002a5e72f21.0.stderr
debug: sending command:
ERET P 0 65536
/home/local/agrid/agp11092/dc6d8dec-19a3-11dc-9b58-0002a5e72f21.0.stderr
debug: response from
gsiftp://hydra.gup.uni-linz.ac.at:2811/home/local/agrid/agp11092/dc6d8dec-19a3-11dc-9b58-0002a5e72f21.0.stderr:
125 Begining transfer; reusing existing data connection.
debug: reading into data buffer 0x812c480, maximum length 65536
debug: data callback, no error, buffer 0x812c480, length 0, offset=0,
eof=true
debug: response from
gsiftp://hydra.gup.uni-linz.ac.at:2811/home/local/agrid/agp11092/dc6d8dec-19a3-11dc-9b58-0002a5e72f21.0.stderr:
226 Transfer Complete.
debug: operation complete
debug: starting to get
gsiftp://hydra.gup.uni-linz.ac.at:2811/home/local/agrid/agp11092/dc6d8dec-19a3-11dc-9b58-0002a5e72f21.0.stdout
debug: sending command:
ERET P 6 65536
/home/local/agrid/agp11092/dc6d8dec-19a3-11dc-9b58-0002a5e72f21.0.stdout
debug: response from
gsiftp://hydra.gup.uni-linz.ac.at:2811/home/local/agrid/agp11092/dc6d8dec-19a3-11dc-9b58-0002a5e72f21.0.stdout:
125 Begining transfer; reusing existing data connection.
debug: reading into data buffer 0x812c480, maximum length 65536
debug: data callback, no error, buffer 0x812c480, length 0, offset=6,
eof=true
debug: response from
gsiftp://hydra.gup.uni-linz.ac.at:2811/home/local/agrid/agp11092/dc6d8dec-19a3-11dc-9b58-0002a5e72f21.0.stdout:
226 Transfer Complete.
...
it's definitely not a problem on the client-side because when i execute
exactly the same command (except changing the factory contact to altix1)
on the same host, with the same credentials... the command runs and
terminates as expected.
We are kinda out of ideas so if anybody could give us some directions we
would be really grateful!
Regards,
Christoph Spielmann
Christoph Spielmann wrote:
Hi everybody!
After about a week of trial and error debugging work i decided to have
some experts look at my problem. ;)
First of all some background: We installed GT 4.0.4 on one of our
clusters and as far as i could remember globusrun-ws (i was asking one
of my collegue who was helping me with the installation but he wasn't
sure himself) was working when we first tried it. After a while we
remarked that some other components of GT weren't working as expected
and we figured the problem was the fact that we run gpt-postinstall as
root and not as user globus while we were installing it, so we run
gpt-postinstall again as user 'globus'.
Now every time we try to use globusrun-ws it hangs. globus-job-run on
the other hand works perfectly.
Here's the output when i try to run a job with globusrun-ws:
globusrun-ws -submit -F hydra -s -c /bin/hostname
Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:1f363be0-182a-11dc-8b33-0002a5e72f21
Termination time: 06/12/2007 14:43 GMT
Current job state: Active
hydra <-- it hangs here till i cancel the job.
After cancelling the job it continues and i get this:
Canceling...Canceled.
Destroying job...Done.
Cleaning up any delegated credentials...Done.
globusrun-ws: Operation was canceled
the output of globus-gatekeeper.log
TIME: Mon Jun 11 16:38:17 2007
PID: 24648 -- Notice: 6: Got connection 140.78.104.101 at Mon Jun 11
16:38:17 2007
Failed reading length 0
GSS authentication failure
globus_gss_assist token :3: read failure: Connection closed
Failure: GSS failed Major:01090000 Minor:00000000 Token:00000003
TIME: Mon Jun 11 16:38:17 2007
PID: 24648 -- Failure: GSS failed Major:01090000 Minor:00000000
Token:00000003
I'll attach the container.log to this mail because it's a rather big file!
Does anybody have a clue what could be wrong here?
Regards,
Christoph Spielmann