Hello all,
I did a little more debugging on this myself. Here is what I did.
1. Submitted a job from GW using gwsubmit.
2. From inside of Globus' logs, I could get the globus job ID for this
job.
3. I created a relevant EPR file for this and used globusrun-ws -
status -job-epr-file <> to check the status of the job.
host:~> gwsubmit pi/pi.jt
host:~> gwps
USER JID DM EM START END EXEC XFER EXIT
NAME HOST
host:~> gwps
USER JID DM EM START END EXEC XFER EXIT
NAME HOST
host:~> gwsubmit pi/pi.jt
host:~> gwps
USER JID DM EM START END EXEC XFER EXIT
NAME HOST
loniuser:0 0 pend ---- 15:26:54 --:--:-- 0:00:00 0:00:00 --
pi.jt --
host:~> gwps
USER JID DM EM START END EXEC XFER EXIT
NAME HOST
loniuser:0 0 pend ---- 15:26:54 --:--:-- 0:00:00 0:00:00 --
pi.jt --
host:~> gwps
USER JID DM EM START END EXEC XFER EXIT
NAME HOST
loniuser:0 0 pend ---- 15:26:54 --:--:-- 0:00:00 0:00:00 --
pi.jt --
host:~> gwps
USER JID DM EM START END EXEC XFER EXIT
NAME HOST
loniuser:0 0 wrap ---- 15:26:54 --:--:-- 0:00:02 0:00:01 --
pi.jt host/LSF
host:~> vi slice.epr
host:~> globusrun-ws -status -job-epr-file slice.epr
Current job state: Unsubmitted
host:~> globusrun-ws -status -job-epr-file slice.epr
Current job state: Unsubmitted
host:~> globusrun-ws -status -job-epr-file slice.epr
Current job state: Unsubmitted
host:~> globusrun-ws -status -job-epr-file slice.epr
Current job state: Active
host:~> gwkill 0
At this point, this command basically hangs. I had to background it
and run the globusrun-ws -status to get the real status in the backend
and it was
host:~> globusrun-ws -status -job-epr-file slice.epr
globusrun-ws: Error: invalid or unknown job resource while querying
job state. Job may have expired or been destroyed.
So, somehow GridWay is not notified (or GridWay does not successfully
find out) the status of the job. Any idea what is going on here?
Thanks,
Prakash
On Oct 28, 2010, at 3:56 PM, Prakash Velayutham wrote:
Hello all,
I did send this to the gridway-users list earlier. Just wanted to
know if someone here has an idea of what might be wrong.
When a GridWay submitted job is running, if the submission user
kills it using GridWay like
gwkill <jobid>
the command just hangs. Further gwps still shows job as
gwad...@bmiclusterapps:/usr/local/gridway/5.6.1> gwps
USER JID DM EM START END EXEC XFER EXIT
NAME HOST
velge9:0 0 canl actv 22:29:42 --:--:-- 0:10:39 0:00:01 --
pi.jt bmiclusterapps.cchmc.org/LSF
gwad...@bmiclusterapps:/usr/local/gridway/5.6.1>
In the LRMS itself (LSF in my case), the job has been terminated
correctly. I think that the job has terminated according to Globus.
What is causing this?
I also saw this in gwd.log. The following is when the job is running.
Wed Oct 27 23:15:16 2010 [EM][D]: Reading from MAD pipe 1.
Wed Oct 27 23:15:16 2010 [EM][D]: MAD message received:"POLL 0
SUCCESS ACTIVE".
Wed Oct 27 23:15:21 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:26 2010 [DM][D]: Checking rescheduling conditions
of jobs.
Wed Oct 27 23:15:26 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:15:26 2010 [DM][D]: MAD (builtin) message SCHEDULE_END
- SUCCESS - (info length=1).
Wed Oct 27 23:15:26 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:31 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:36 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:41 2010 [DM][D]: Checking rescheduling conditions
of jobs.
Wed Oct 27 23:15:41 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:15:41 2010 [DM][D]: MAD (builtin) message SCHEDULE_END
- SUCCESS - (info length=1).
Wed Oct 27 23:15:41 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:46 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:51 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:56 2010 [DM][I]: -- MARK --
Wed Oct 27 23:15:56 2010 [DM][D]: Checking rescheduling conditions
of jobs.
Wed Oct 27 23:15:56 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:15:56 2010 [DM][D]: MAD (builtin) message SCHEDULE_END
- SUCCESS - (info length=1).
Wed Oct 27 23:15:56 2010 [TM][I]: -- MARK --
Wed Oct 27 23:15:56 2010 [EM][I]: -- MARK --
Wed Oct 27 23:15:56 2010 [IM][I]: -- MARK --
Wed Oct 27 23:15:56 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:01 2010 [IM][I]: Discovering hosts.
Wed Oct 27 23:16:01 2010 [IM][D]: Discovering hosts with MAD mds4, 1
active queries.
Wed Oct 27 23:16:01 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:01 2010 [IM][D]: MAD (mds4) message DISCOVER -
SUCCESS bmiclusterapps.cchmc.org (info length=25).
Wed Oct 27 23:16:01 2010 [IM][D]: Discovery action done, 0 active
queries.
Wed Oct 27 23:16:01 2010 [IM][I]: Hosts discovered by MAD (mds4):
bmiclusterapps.cchmc.org
Wed Oct 27 23:16:06 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:11 2010 [DM][D]: Checking rescheduling conditions
of jobs.
Wed Oct 27 23:16:11 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:16:11 2010 [DM][D]: MAD (builtin) message SCHEDULE_END
- SUCCESS - (info length=1).
Wed Oct 27 23:16:11 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:16 2010 [EM][I]: Poll timeout of job 0 expired.
Checking execution state.
Wed Oct 27 23:16:16 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:16 2010 [EM][D]: Reading from MAD pipe 1.
Wed Oct 27 23:16:16 2010 [EM][D]: MAD message received:"POLL 0
SUCCESS ACTIVE".
The following is when I try to kill the job using "gwkill 0"
Wed Oct 27 23:16:23 2010 [RM][I]: Authorizing user velge9, with
proxy path "".
Wed Oct 27 23:16:23 2010 [DM][I]: Killing job 0.
Wed Oct 27 23:16:23 2010 [EM][I]: Cancelling job 0.
Wed Oct 27 23:16:23 2010 [EM][D]: Reading from MAD pipe 1.
Wed Oct 27 23:16:23 2010 [EM][D]: MAD message received:"CANCEL 0
SUCCESS -".
Wed Oct 27 23:16:26 2010 [DM][D]: Checking rescheduling conditions
of jobs.
Wed Oct 27 23:16:26 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:16:26 2010 [DM][D]: MAD (builtin) message SCHEDULE_END
- SUCCESS - (info length=1).
Wed Oct 27 23:16:26 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:31 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:36 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:41 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:41 2010 [DM][D]: Checking rescheduling conditions
of jobs.
Wed Oct 27 23:16:41 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:16:41 2010 [DM][D]: MAD (builtin) message SCHEDULE_END
- SUCCESS - (info length=1).
Wed Oct 27 23:16:46 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:51 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:56 2010 [UM][I]: -- MARK --
Wed Oct 27 23:16:56 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:56 2010 [DM][D]: Checking rescheduling conditions
of jobs.
Wed Oct 27 23:16:56 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:16:56 2010 [DM][D]: MAD (builtin) message SCHEDULE_END
- SUCCESS - (info length=1).
Wed Oct 27 23:17:01 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:17:06 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:17:06 2010 [IM][D]: Monitoring host 0.
Wed Oct 27 23:17:06 2010 [IM][D]: Monitoring host 0
("bmiclusterapps.cchmc.org"), 1 active queries.
Wed Oct 27 23:17:06 2010 [IM][D]: MAD (mds4) message MONITOR 0
SUCCESS HOSTNAME="bmiclusterapps.cchmc.org" ARCH="NULL"
OS_NAME="NULL" OS_VERSION="NULL" CPU_M
ODEL="NULL" CPU_MHZ=0 CPU_FREE=0 CPU_SMP=0 NODECOUNT=264
SIZE_MEM_MB=0 FREE_MEM_MB=0 SIZE_DISK_MB=0 FREE_DISK_MB=0
FORK_NAME="Fork" LRMS_NAME="LSF" LRMS_TYPE
="lsf" QUEUE_NAME[0]="pdxpop" QUEUE_NODECOUNT[0]=264
QUEUE_FREENODECOUNT[0]=264 QUEUE_MAXTIME[0]=0 QUEUE_MAXCPUTIME[0]=-1
QUEUE_MAXCOUNT[0]=0 QUEUE_MAXRUNNIN
GJOBS[0]=0 QUEUE_MAXJOBSINQUEUE[0]=0 QUEUE_STATUS[0]="enabled"
QUEUE_DISPATCHTYPE[0]="NULL" QUEUE_PRIORITY[0]="0"
QUEUE_NAME[1]="priority" QUEUE_NODECOUNT[1]
=264 QUEUE_FREENODECOUNT[1]=264 QUEUE_MAXTIME[1]=0
QUEUE_MAXCPUTIME[1]=-1 QUEUE_MAXCOUNT[1]=0 QUEUE_MAXRUNNINGJOBS[1]=0
QUEUE_MAXJOBSINQUEUE[1]=0 QUEUE_STATU
S[1]="enabled" QUEUE_DISPATCHTYPE[1]="NULL" QUEUE_PRIORITY[1]="0"
QUEUE_NAME[2]="night" QUEUE_NODECOUNT[2]=264
QUEUE_FREENODECOUNT[2]=264 QUEUE_MAXTIME[2]=0
QUEUE_MAXCPUTIME[2]=-1 QUEUE_MAXCOUNT[2]=0 QUEUE_MAXRUNNINGJOBS[2]=0
QUEUE_MAXJOBSINQUEUE[2]=0 QUEUE_STATUS[2]="enabled"
QUEUE_DISPATCHTYPE[2]="NULL" QUEUE_P
RIORITY[2]="0" (info length=2384).
Wed Oct 27 23:17:06 2010 [IM][D]: Monitoring action done, 0 active
queries.
Wed Oct 27 23:17:06 2010 [IM][D]: Host 0 successfully monitored.
Wed Oct 27 23:17:11 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:17:11 2010 [DM][D]: Checking rescheduling conditions
of jobs.
Wed Oct 27 23:17:11 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:17:11 2010 [DM][D]: MAD (builtin) message SCHEDULE_END
- SUCCESS - (info length=1).
Wed Oct 27 23:17:16 2010 [EM][I]: Poll timeout of job 0 expired.
Checking execution state.
Wed Oct 27 23:17:16 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:17:21 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:17:26 2010 [DM][D]: Checking rescheduling conditions
of jobs.
Even though it says "Poll timeout of job 0 expired. Checking
execution state.", it never runs the "POLL 0 SUCCESS ACTIVE" thing
like earlier. May be that is what is causing this issue??!!
Thanks,
Prakash