Hello all,

I did a little more debugging on this myself. Here is what I did.

1. Submitted a job from GW using gwsubmit.
2. From inside of Globus' logs, I could get the globus job ID for this job. 3. I created a relevant EPR file for this and used globusrun-ws - status -job-epr-file <> to check the status of the job.

host:~> gwsubmit pi/pi.jt
host:~> gwps
USER JID DM EM START END EXEC XFER EXIT NAME HOST
host:~> gwps
USER JID DM EM START END EXEC XFER EXIT NAME HOST
host:~> gwsubmit pi/pi.jt
host:~> gwps
USER JID DM EM START END EXEC XFER EXIT NAME HOST loniuser:0 0 pend ---- 15:26:54 --:--:-- 0:00:00 0:00:00 -- pi.jt --
host:~> gwps
USER JID DM EM START END EXEC XFER EXIT NAME HOST loniuser:0 0 pend ---- 15:26:54 --:--:-- 0:00:00 0:00:00 -- pi.jt --
host:~> gwps
USER JID DM EM START END EXEC XFER EXIT NAME HOST loniuser:0 0 pend ---- 15:26:54 --:--:-- 0:00:00 0:00:00 -- pi.jt --
host:~> gwps
USER JID DM EM START END EXEC XFER EXIT NAME HOST loniuser:0 0 wrap ---- 15:26:54 --:--:-- 0:00:02 0:00:01 -- pi.jt host/LSF
host:~> vi slice.epr
host:~> globusrun-ws -status -job-epr-file slice.epr
Current job state: Unsubmitted
host:~> globusrun-ws -status -job-epr-file slice.epr
Current job state: Unsubmitted
host:~> globusrun-ws -status -job-epr-file slice.epr
Current job state: Unsubmitted
host:~> globusrun-ws -status -job-epr-file slice.epr
Current job state: Active
host:~> gwkill 0

At this point, this command basically hangs. I had to background it and run the globusrun-ws -status to get the real status in the backend and it was

host:~> globusrun-ws -status -job-epr-file slice.epr
globusrun-ws: Error: invalid or unknown job resource while querying job state. Job may have expired or been destroyed.

So, somehow GridWay is not notified (or GridWay does not successfully find out) the status of the job. Any idea what is going on here?

Thanks,
Prakash

On Oct 28, 2010, at 3:56 PM, Prakash Velayutham wrote:

Hello all,

I did send this to the gridway-users list earlier. Just wanted to know if someone here has an idea of what might be wrong.

When a GridWay submitted job is running, if the submission user kills it using GridWay like

gwkill <jobid>

the command just hangs. Further gwps still shows job as

gwad...@bmiclusterapps:/usr/local/gridway/5.6.1> gwps
USER JID DM EM START END EXEC XFER EXIT NAME HOST velge9:0 0 canl actv 22:29:42 --:--:-- 0:10:39 0:00:01 -- pi.jt bmiclusterapps.cchmc.org/LSF
gwad...@bmiclusterapps:/usr/local/gridway/5.6.1>

In the LRMS itself (LSF in my case), the job has been terminated correctly. I think that the job has terminated according to Globus.

What is causing this?

I also saw this in gwd.log. The following is when the job is running.

Wed Oct 27 23:15:16 2010 [EM][D]: Reading from MAD pipe 1.
Wed Oct 27 23:15:16 2010 [EM][D]: MAD message received:"POLL 0 SUCCESS ACTIVE".
Wed Oct 27 23:15:21 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:26 2010 [DM][D]: Checking rescheduling conditions of jobs.
Wed Oct 27 23:15:26 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:15:26 2010 [DM][D]: MAD (builtin) message SCHEDULE_END - SUCCESS - (info length=1).
Wed Oct 27 23:15:26 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:31 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:36 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:41 2010 [DM][D]: Checking rescheduling conditions of jobs.
Wed Oct 27 23:15:41 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:15:41 2010 [DM][D]: MAD (builtin) message SCHEDULE_END - SUCCESS - (info length=1).
Wed Oct 27 23:15:41 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:46 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:51 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:56 2010 [DM][I]: -- MARK --
Wed Oct 27 23:15:56 2010 [DM][D]: Checking rescheduling conditions of jobs.
Wed Oct 27 23:15:56 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:15:56 2010 [DM][D]: MAD (builtin) message SCHEDULE_END - SUCCESS - (info length=1).
Wed Oct 27 23:15:56 2010 [TM][I]: -- MARK --
Wed Oct 27 23:15:56 2010 [EM][I]: -- MARK --
Wed Oct 27 23:15:56 2010 [IM][I]: -- MARK --
Wed Oct 27 23:15:56 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:01 2010 [IM][I]: Discovering hosts.
Wed Oct 27 23:16:01 2010 [IM][D]: Discovering hosts with MAD mds4, 1 active queries.
Wed Oct 27 23:16:01 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:01 2010 [IM][D]: MAD (mds4) message DISCOVER - SUCCESS bmiclusterapps.cchmc.org (info length=25). Wed Oct 27 23:16:01 2010 [IM][D]: Discovery action done, 0 active queries. Wed Oct 27 23:16:01 2010 [IM][I]: Hosts discovered by MAD (mds4): bmiclusterapps.cchmc.org
Wed Oct 27 23:16:06 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:11 2010 [DM][D]: Checking rescheduling conditions of jobs.
Wed Oct 27 23:16:11 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:16:11 2010 [DM][D]: MAD (builtin) message SCHEDULE_END - SUCCESS - (info length=1).
Wed Oct 27 23:16:11 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:16 2010 [EM][I]: Poll timeout of job 0 expired. Checking execution state.
Wed Oct 27 23:16:16 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:16 2010 [EM][D]: Reading from MAD pipe 1.
Wed Oct 27 23:16:16 2010 [EM][D]: MAD message received:"POLL 0 SUCCESS ACTIVE".

The following is when I try to kill the job using "gwkill 0"

Wed Oct 27 23:16:23 2010 [RM][I]: Authorizing user velge9, with proxy path "".
Wed Oct 27 23:16:23 2010 [DM][I]: Killing job 0.
Wed Oct 27 23:16:23 2010 [EM][I]: Cancelling job 0.
Wed Oct 27 23:16:23 2010 [EM][D]: Reading from MAD pipe 1.
Wed Oct 27 23:16:23 2010 [EM][D]: MAD message received:"CANCEL 0 SUCCESS -". Wed Oct 27 23:16:26 2010 [DM][D]: Checking rescheduling conditions of jobs.
Wed Oct 27 23:16:26 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:16:26 2010 [DM][D]: MAD (builtin) message SCHEDULE_END - SUCCESS - (info length=1).
Wed Oct 27 23:16:26 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:31 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:36 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:41 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:41 2010 [DM][D]: Checking rescheduling conditions of jobs.
Wed Oct 27 23:16:41 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:16:41 2010 [DM][D]: MAD (builtin) message SCHEDULE_END - SUCCESS - (info length=1).
Wed Oct 27 23:16:46 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:51 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:56 2010 [UM][I]: -- MARK --
Wed Oct 27 23:16:56 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:56 2010 [DM][D]: Checking rescheduling conditions of jobs.
Wed Oct 27 23:16:56 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:16:56 2010 [DM][D]: MAD (builtin) message SCHEDULE_END - SUCCESS - (info length=1).
Wed Oct 27 23:17:01 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:17:06 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:17:06 2010 [IM][D]:       Monitoring host 0.
Wed Oct 27 23:17:06 2010 [IM][D]: Monitoring host 0 ("bmiclusterapps.cchmc.org"), 1 active queries. Wed Oct 27 23:17:06 2010 [IM][D]: MAD (mds4) message MONITOR 0 SUCCESS HOSTNAME="bmiclusterapps.cchmc.org" ARCH="NULL" OS_NAME="NULL" OS_VERSION="NULL" CPU_M ODEL="NULL" CPU_MHZ=0 CPU_FREE=0 CPU_SMP=0 NODECOUNT=264 SIZE_MEM_MB=0 FREE_MEM_MB=0 SIZE_DISK_MB=0 FREE_DISK_MB=0 FORK_NAME="Fork" LRMS_NAME="LSF" LRMS_TYPE ="lsf" QUEUE_NAME[0]="pdxpop" QUEUE_NODECOUNT[0]=264 QUEUE_FREENODECOUNT[0]=264 QUEUE_MAXTIME[0]=0 QUEUE_MAXCPUTIME[0]=-1 QUEUE_MAXCOUNT[0]=0 QUEUE_MAXRUNNIN GJOBS[0]=0 QUEUE_MAXJOBSINQUEUE[0]=0 QUEUE_STATUS[0]="enabled" QUEUE_DISPATCHTYPE[0]="NULL" QUEUE_PRIORITY[0]="0" QUEUE_NAME[1]="priority" QUEUE_NODECOUNT[1] =264 QUEUE_FREENODECOUNT[1]=264 QUEUE_MAXTIME[1]=0 QUEUE_MAXCPUTIME[1]=-1 QUEUE_MAXCOUNT[1]=0 QUEUE_MAXRUNNINGJOBS[1]=0 QUEUE_MAXJOBSINQUEUE[1]=0 QUEUE_STATU S[1]="enabled" QUEUE_DISPATCHTYPE[1]="NULL" QUEUE_PRIORITY[1]="0" QUEUE_NAME[2]="night" QUEUE_NODECOUNT[2]=264 QUEUE_FREENODECOUNT[2]=264 QUEUE_MAXTIME[2]=0 QUEUE_MAXCPUTIME[2]=-1 QUEUE_MAXCOUNT[2]=0 QUEUE_MAXRUNNINGJOBS[2]=0 QUEUE_MAXJOBSINQUEUE[2]=0 QUEUE_STATUS[2]="enabled" QUEUE_DISPATCHTYPE[2]="NULL" QUEUE_P
RIORITY[2]="0" (info length=2384).
Wed Oct 27 23:17:06 2010 [IM][D]: Monitoring action done, 0 active queries.
Wed Oct 27 23:17:06 2010 [IM][D]: Host 0 successfully monitored.
Wed Oct 27 23:17:11 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:17:11 2010 [DM][D]: Checking rescheduling conditions of jobs.
Wed Oct 27 23:17:11 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:17:11 2010 [DM][D]: MAD (builtin) message SCHEDULE_END - SUCCESS - (info length=1). Wed Oct 27 23:17:16 2010 [EM][I]: Poll timeout of job 0 expired. Checking execution state.
Wed Oct 27 23:17:16 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:17:21 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:17:26 2010 [DM][D]: Checking rescheduling conditions of jobs.

Even though it says "Poll timeout of job 0 expired. Checking execution state.", it never runs the "POLL 0 SUCCESS ACTIVE" thing like earlier. May be that is what is causing this issue??!!

Thanks,
Prakash

Reply via email to