Hello all,

I did send this to the gridway-users list earlier. Just wanted to know if someone here has an idea of what might be wrong.

When a GridWay submitted job is running, if the submission user kills it using GridWay like

gwkill <jobid>

the command just hangs. Further gwps still shows job as

gwad...@bmiclusterapps:/usr/local/gridway/5.6.1> gwps
USER JID DM EM START END EXEC XFER EXIT NAME HOST velge9:0 0 canl actv 22:29:42 --:--:-- 0:10:39 0:00:01 -- pi.jt bmiclusterapps.cchmc.org/LSF
gwad...@bmiclusterapps:/usr/local/gridway/5.6.1>

In the LRMS itself (LSF in my case), the job has been terminated correctly. I think that the job has terminated according to Globus.

What is causing this?

I also saw this in gwd.log. The following is when the job is running.

Wed Oct 27 23:15:16 2010 [EM][D]: Reading from MAD pipe 1.
Wed Oct 27 23:15:16 2010 [EM][D]: MAD message received:"POLL 0 SUCCESS ACTIVE".
Wed Oct 27 23:15:21 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:26 2010 [DM][D]: Checking rescheduling conditions of jobs.
Wed Oct 27 23:15:26 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:15:26 2010 [DM][D]: MAD (builtin) message SCHEDULE_END - SUCCESS - (info length=1).
Wed Oct 27 23:15:26 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:31 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:36 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:41 2010 [DM][D]: Checking rescheduling conditions of jobs.
Wed Oct 27 23:15:41 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:15:41 2010 [DM][D]: MAD (builtin) message SCHEDULE_END - SUCCESS - (info length=1).
Wed Oct 27 23:15:41 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:46 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:51 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:56 2010 [DM][I]: -- MARK --
Wed Oct 27 23:15:56 2010 [DM][D]: Checking rescheduling conditions of jobs.
Wed Oct 27 23:15:56 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:15:56 2010 [DM][D]: MAD (builtin) message SCHEDULE_END - SUCCESS - (info length=1).
Wed Oct 27 23:15:56 2010 [TM][I]: -- MARK --
Wed Oct 27 23:15:56 2010 [EM][I]: -- MARK --
Wed Oct 27 23:15:56 2010 [IM][I]: -- MARK --
Wed Oct 27 23:15:56 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:01 2010 [IM][I]: Discovering hosts.
Wed Oct 27 23:16:01 2010 [IM][D]: Discovering hosts with MAD mds4, 1 active queries.
Wed Oct 27 23:16:01 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:01 2010 [IM][D]: MAD (mds4) message DISCOVER - SUCCESS bmiclusterapps.cchmc.org (info length=25). Wed Oct 27 23:16:01 2010 [IM][D]: Discovery action done, 0 active queries. Wed Oct 27 23:16:01 2010 [IM][I]: Hosts discovered by MAD (mds4): bmiclusterapps.cchmc.org
Wed Oct 27 23:16:06 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:11 2010 [DM][D]: Checking rescheduling conditions of jobs.
Wed Oct 27 23:16:11 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:16:11 2010 [DM][D]: MAD (builtin) message SCHEDULE_END - SUCCESS - (info length=1).
Wed Oct 27 23:16:11 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:16 2010 [EM][I]: Poll timeout of job 0 expired. Checking execution state.
Wed Oct 27 23:16:16 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:16 2010 [EM][D]: Reading from MAD pipe 1.
Wed Oct 27 23:16:16 2010 [EM][D]: MAD message received:"POLL 0 SUCCESS ACTIVE".

The following is when I try to kill the job using "gwkill 0"

Wed Oct 27 23:16:23 2010 [RM][I]: Authorizing user velge9, with proxy path "".
Wed Oct 27 23:16:23 2010 [DM][I]: Killing job 0.
Wed Oct 27 23:16:23 2010 [EM][I]: Cancelling job 0.
Wed Oct 27 23:16:23 2010 [EM][D]: Reading from MAD pipe 1.
Wed Oct 27 23:16:23 2010 [EM][D]: MAD message received:"CANCEL 0 SUCCESS -". Wed Oct 27 23:16:26 2010 [DM][D]: Checking rescheduling conditions of jobs.
Wed Oct 27 23:16:26 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:16:26 2010 [DM][D]: MAD (builtin) message SCHEDULE_END - SUCCESS - (info length=1).
Wed Oct 27 23:16:26 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:31 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:36 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:41 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:41 2010 [DM][D]: Checking rescheduling conditions of jobs.
Wed Oct 27 23:16:41 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:16:41 2010 [DM][D]: MAD (builtin) message SCHEDULE_END - SUCCESS - (info length=1).
Wed Oct 27 23:16:46 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:51 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:56 2010 [UM][I]: -- MARK --
Wed Oct 27 23:16:56 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:56 2010 [DM][D]: Checking rescheduling conditions of jobs.
Wed Oct 27 23:16:56 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:16:56 2010 [DM][D]: MAD (builtin) message SCHEDULE_END - SUCCESS - (info length=1).
Wed Oct 27 23:17:01 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:17:06 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:17:06 2010 [IM][D]:       Monitoring host 0.
Wed Oct 27 23:17:06 2010 [IM][D]: Monitoring host 0 ("bmiclusterapps.cchmc.org"), 1 active queries. Wed Oct 27 23:17:06 2010 [IM][D]: MAD (mds4) message MONITOR 0 SUCCESS HOSTNAME="bmiclusterapps.cchmc.org" ARCH="NULL" OS_NAME="NULL" OS_VERSION="NULL" CPU_M ODEL="NULL" CPU_MHZ=0 CPU_FREE=0 CPU_SMP=0 NODECOUNT=264 SIZE_MEM_MB=0 FREE_MEM_MB=0 SIZE_DISK_MB=0 FREE_DISK_MB=0 FORK_NAME="Fork" LRMS_NAME="LSF" LRMS_TYPE ="lsf" QUEUE_NAME[0]="pdxpop" QUEUE_NODECOUNT[0]=264 QUEUE_FREENODECOUNT[0]=264 QUEUE_MAXTIME[0]=0 QUEUE_MAXCPUTIME[0]=-1 QUEUE_MAXCOUNT[0]=0 QUEUE_MAXRUNNIN GJOBS[0]=0 QUEUE_MAXJOBSINQUEUE[0]=0 QUEUE_STATUS[0]="enabled" QUEUE_DISPATCHTYPE[0]="NULL" QUEUE_PRIORITY[0]="0" QUEUE_NAME[1]="priority" QUEUE_NODECOUNT[1] =264 QUEUE_FREENODECOUNT[1]=264 QUEUE_MAXTIME[1]=0 QUEUE_MAXCPUTIME[1]=-1 QUEUE_MAXCOUNT[1]=0 QUEUE_MAXRUNNINGJOBS[1]=0 QUEUE_MAXJOBSINQUEUE[1]=0 QUEUE_STATU S[1]="enabled" QUEUE_DISPATCHTYPE[1]="NULL" QUEUE_PRIORITY[1]="0" QUEUE_NAME[2]="night" QUEUE_NODECOUNT[2]=264 QUEUE_FREENODECOUNT[2]=264 QUEUE_MAXTIME[2]=0 QUEUE_MAXCPUTIME[2]=-1 QUEUE_MAXCOUNT[2]=0 QUEUE_MAXRUNNINGJOBS[2]=0 QUEUE_MAXJOBSINQUEUE[2]=0 QUEUE_STATUS[2]="enabled" QUEUE_DISPATCHTYPE[2]="NULL" QUEUE_P
RIORITY[2]="0" (info length=2384).
Wed Oct 27 23:17:06 2010 [IM][D]: Monitoring action done, 0 active queries.
Wed Oct 27 23:17:06 2010 [IM][D]: Host 0 successfully monitored.
Wed Oct 27 23:17:11 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:17:11 2010 [DM][D]: Checking rescheduling conditions of jobs.
Wed Oct 27 23:17:11 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:17:11 2010 [DM][D]: MAD (builtin) message SCHEDULE_END - SUCCESS - (info length=1). Wed Oct 27 23:17:16 2010 [EM][I]: Poll timeout of job 0 expired. Checking execution state.
Wed Oct 27 23:17:16 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:17:21 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:17:26 2010 [DM][D]: Checking rescheduling conditions of jobs.

Even though it says "Poll timeout of job 0 expired. Checking execution state.", it never runs the "POLL 0 SUCCESS ACTIVE" thing like earlier. May be that is what is causing this issue??!!

Thanks,
Prakash

Reply via email to