Hello all,
I did send this to the gridway-users list earlier. Just wanted to know
if someone here has an idea of what might be wrong.
When a GridWay submitted job is running, if the submission user kills
it using GridWay like
gwkill <jobid>
the command just hangs. Further gwps still shows job as
gwad...@bmiclusterapps:/usr/local/gridway/5.6.1> gwps
USER JID DM EM START END EXEC XFER EXIT
NAME HOST
velge9:0 0 canl actv 22:29:42 --:--:-- 0:10:39 0:00:01 --
pi.jt bmiclusterapps.cchmc.org/LSF
gwad...@bmiclusterapps:/usr/local/gridway/5.6.1>
In the LRMS itself (LSF in my case), the job has been terminated
correctly. I think that the job has terminated according to Globus.
What is causing this?
I also saw this in gwd.log. The following is when the job is running.
Wed Oct 27 23:15:16 2010 [EM][D]: Reading from MAD pipe 1.
Wed Oct 27 23:15:16 2010 [EM][D]: MAD message received:"POLL 0 SUCCESS
ACTIVE".
Wed Oct 27 23:15:21 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:26 2010 [DM][D]: Checking rescheduling conditions of
jobs.
Wed Oct 27 23:15:26 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:15:26 2010 [DM][D]: MAD (builtin) message SCHEDULE_END -
SUCCESS - (info length=1).
Wed Oct 27 23:15:26 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:31 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:36 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:41 2010 [DM][D]: Checking rescheduling conditions of
jobs.
Wed Oct 27 23:15:41 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:15:41 2010 [DM][D]: MAD (builtin) message SCHEDULE_END -
SUCCESS - (info length=1).
Wed Oct 27 23:15:41 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:46 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:51 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:56 2010 [DM][I]: -- MARK --
Wed Oct 27 23:15:56 2010 [DM][D]: Checking rescheduling conditions of
jobs.
Wed Oct 27 23:15:56 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:15:56 2010 [DM][D]: MAD (builtin) message SCHEDULE_END -
SUCCESS - (info length=1).
Wed Oct 27 23:15:56 2010 [TM][I]: -- MARK --
Wed Oct 27 23:15:56 2010 [EM][I]: -- MARK --
Wed Oct 27 23:15:56 2010 [IM][I]: -- MARK --
Wed Oct 27 23:15:56 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:01 2010 [IM][I]: Discovering hosts.
Wed Oct 27 23:16:01 2010 [IM][D]: Discovering hosts with MAD mds4, 1
active queries.
Wed Oct 27 23:16:01 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:01 2010 [IM][D]: MAD (mds4) message DISCOVER -
SUCCESS bmiclusterapps.cchmc.org (info length=25).
Wed Oct 27 23:16:01 2010 [IM][D]: Discovery action done, 0 active
queries.
Wed Oct 27 23:16:01 2010 [IM][I]: Hosts discovered by MAD (mds4):
bmiclusterapps.cchmc.org
Wed Oct 27 23:16:06 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:11 2010 [DM][D]: Checking rescheduling conditions of
jobs.
Wed Oct 27 23:16:11 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:16:11 2010 [DM][D]: MAD (builtin) message SCHEDULE_END -
SUCCESS - (info length=1).
Wed Oct 27 23:16:11 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:16 2010 [EM][I]: Poll timeout of job 0 expired.
Checking execution state.
Wed Oct 27 23:16:16 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:16 2010 [EM][D]: Reading from MAD pipe 1.
Wed Oct 27 23:16:16 2010 [EM][D]: MAD message received:"POLL 0 SUCCESS
ACTIVE".
The following is when I try to kill the job using "gwkill 0"
Wed Oct 27 23:16:23 2010 [RM][I]: Authorizing user velge9, with proxy
path "".
Wed Oct 27 23:16:23 2010 [DM][I]: Killing job 0.
Wed Oct 27 23:16:23 2010 [EM][I]: Cancelling job 0.
Wed Oct 27 23:16:23 2010 [EM][D]: Reading from MAD pipe 1.
Wed Oct 27 23:16:23 2010 [EM][D]: MAD message received:"CANCEL 0
SUCCESS -".
Wed Oct 27 23:16:26 2010 [DM][D]: Checking rescheduling conditions of
jobs.
Wed Oct 27 23:16:26 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:16:26 2010 [DM][D]: MAD (builtin) message SCHEDULE_END -
SUCCESS - (info length=1).
Wed Oct 27 23:16:26 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:31 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:36 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:41 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:41 2010 [DM][D]: Checking rescheduling conditions of
jobs.
Wed Oct 27 23:16:41 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:16:41 2010 [DM][D]: MAD (builtin) message SCHEDULE_END -
SUCCESS - (info length=1).
Wed Oct 27 23:16:46 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:51 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:56 2010 [UM][I]: -- MARK --
Wed Oct 27 23:16:56 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:56 2010 [DM][D]: Checking rescheduling conditions of
jobs.
Wed Oct 27 23:16:56 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:16:56 2010 [DM][D]: MAD (builtin) message SCHEDULE_END -
SUCCESS - (info length=1).
Wed Oct 27 23:17:01 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:17:06 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:17:06 2010 [IM][D]: Monitoring host 0.
Wed Oct 27 23:17:06 2010 [IM][D]: Monitoring host 0
("bmiclusterapps.cchmc.org"), 1 active queries.
Wed Oct 27 23:17:06 2010 [IM][D]: MAD (mds4) message MONITOR 0 SUCCESS
HOSTNAME="bmiclusterapps.cchmc.org" ARCH="NULL" OS_NAME="NULL"
OS_VERSION="NULL" CPU_M
ODEL="NULL" CPU_MHZ=0 CPU_FREE=0 CPU_SMP=0 NODECOUNT=264 SIZE_MEM_MB=0
FREE_MEM_MB=0 SIZE_DISK_MB=0 FREE_DISK_MB=0 FORK_NAME="Fork"
LRMS_NAME="LSF" LRMS_TYPE
="lsf" QUEUE_NAME[0]="pdxpop" QUEUE_NODECOUNT[0]=264
QUEUE_FREENODECOUNT[0]=264 QUEUE_MAXTIME[0]=0 QUEUE_MAXCPUTIME[0]=-1
QUEUE_MAXCOUNT[0]=0 QUEUE_MAXRUNNIN
GJOBS[0]=0 QUEUE_MAXJOBSINQUEUE[0]=0 QUEUE_STATUS[0]="enabled"
QUEUE_DISPATCHTYPE[0]="NULL" QUEUE_PRIORITY[0]="0"
QUEUE_NAME[1]="priority" QUEUE_NODECOUNT[1]
=264 QUEUE_FREENODECOUNT[1]=264 QUEUE_MAXTIME[1]=0
QUEUE_MAXCPUTIME[1]=-1 QUEUE_MAXCOUNT[1]=0 QUEUE_MAXRUNNINGJOBS[1]=0
QUEUE_MAXJOBSINQUEUE[1]=0 QUEUE_STATU
S[1]="enabled" QUEUE_DISPATCHTYPE[1]="NULL" QUEUE_PRIORITY[1]="0"
QUEUE_NAME[2]="night" QUEUE_NODECOUNT[2]=264
QUEUE_FREENODECOUNT[2]=264 QUEUE_MAXTIME[2]=0
QUEUE_MAXCPUTIME[2]=-1 QUEUE_MAXCOUNT[2]=0 QUEUE_MAXRUNNINGJOBS[2]=0
QUEUE_MAXJOBSINQUEUE[2]=0 QUEUE_STATUS[2]="enabled"
QUEUE_DISPATCHTYPE[2]="NULL" QUEUE_P
RIORITY[2]="0" (info length=2384).
Wed Oct 27 23:17:06 2010 [IM][D]: Monitoring action done, 0 active
queries.
Wed Oct 27 23:17:06 2010 [IM][D]: Host 0 successfully monitored.
Wed Oct 27 23:17:11 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:17:11 2010 [DM][D]: Checking rescheduling conditions of
jobs.
Wed Oct 27 23:17:11 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:17:11 2010 [DM][D]: MAD (builtin) message SCHEDULE_END -
SUCCESS - (info length=1).
Wed Oct 27 23:17:16 2010 [EM][I]: Poll timeout of job 0 expired.
Checking execution state.
Wed Oct 27 23:17:16 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:17:21 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:17:26 2010 [DM][D]: Checking rescheduling conditions of
jobs.
Even though it says "Poll timeout of job 0 expired. Checking execution
state.", it never runs the "POLL 0 SUCCESS ACTIVE" thing like earlier.
May be that is what is causing this issue??!!
Thanks,
Prakash