Re: [gt-user] Gridway jobs not getting cancelled correctly when using gwkill

Prakash Velayutham Thu, 04 Nov 2010 13:09:16 -0700

Hello all,

I did a little more debugging on this myself. Here is what I did.


1. Submitted a job from GW using gwsubmit.

2. From inside of Globus' logs, I could get the globus job ID for thisjob.3. I created a relevant EPR file for this and used globusrun-ws -status -job-epr-file <> to check the status of the job.


host:~> gwsubmit pi/pi.jt
host:~> gwps

USER JID DM EM START END EXEC XFER EXITNAME HOST

host:~> gwps

USER JID DM EM START END EXEC XFER EXITNAME HOST

host:~> gwsubmit pi/pi.jt
host:~> gwps

USER JID DM EM START END EXEC XFER EXITNAME HOSTloniuser:0 0 pend ---- 15:26:54 --:--:-- 0:00:00 0:00:00 --pi.jt --

host:~> gwps

USER JID DM EM START END EXEC XFER EXITNAME HOSTloniuser:0 0 pend ---- 15:26:54 --:--:-- 0:00:00 0:00:00 --pi.jt --

host:~> gwps

USER JID DM EM START END EXEC XFER EXITNAME HOSTloniuser:0 0 pend ---- 15:26:54 --:--:-- 0:00:00 0:00:00 --pi.jt --

host:~> gwps

USER JID DM EM START END EXEC XFER EXITNAME HOSTloniuser:0 0 wrap ---- 15:26:54 --:--:-- 0:00:02 0:00:01 --pi.jt host/LSF

host:~> vi slice.epr
host:~> globusrun-ws -status -job-epr-file slice.epr
Current job state: Unsubmitted
host:~> globusrun-ws -status -job-epr-file slice.epr
Current job state: Unsubmitted
host:~> globusrun-ws -status -job-epr-file slice.epr
Current job state: Unsubmitted
host:~> globusrun-ws -status -job-epr-file slice.epr
Current job state: Active
host:~> gwkill 0

At this point, this command basically hangs. I had to background itand run the globusrun-ws -status to get the real status in the backendand it was


host:~> globusrun-ws -status -job-epr-file slice.epr

globusrun-ws: Error: invalid or unknown job resource while queryingjob state. Job may have expired or been destroyed.

So, somehow GridWay is not notified (or GridWay does not successfullyfind out) the status of the job. Any idea what is going on here?


Thanks,
Prakash

On Oct 28, 2010, at 3:56 PM, Prakash Velayutham wrote:

Hello all,
I did send this to the gridway-users list earlier. Just wanted toknow if someone here has an idea of what might be wrong.
When a GridWay submitted job is running, if the submission userkills it using GridWay like
gwkill <jobid>

the command just hangs. Further gwps still shows job as

gwad...@bmiclusterapps:/usr/local/gridway/5.6.1> gwps
USER JID DM EM START END EXEC XFER EXITNAME HOSTvelge9:0 0 canl actv 22:29:42 --:--:-- 0:10:39 0:00:01 --pi.jt bmiclusterapps.cchmc.org/LSF
gwad...@bmiclusterapps:/usr/local/gridway/5.6.1>
In the LRMS itself (LSF in my case), the job has been terminatedcorrectly. I think that the job has terminated according to Globus.
What is causing this?

I also saw this in gwd.log. The following is when the job is running.

Wed Oct 27 23:15:16 2010 [EM][D]: Reading from MAD pipe 1.
Wed Oct 27 23:15:16 2010 [EM][D]: MAD message received:"POLL 0SUCCESS ACTIVE".
Wed Oct 27 23:15:21 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:26 2010 [DM][D]: Checking rescheduling conditionsof jobs.
Wed Oct 27 23:15:26 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:15:26 2010 [DM][D]: MAD (builtin) message SCHEDULE_END- SUCCESS - (info length=1).
Wed Oct 27 23:15:26 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:31 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:36 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:41 2010 [DM][D]: Checking rescheduling conditionsof jobs.
Wed Oct 27 23:15:41 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:15:41 2010 [DM][D]: MAD (builtin) message SCHEDULE_END- SUCCESS - (info length=1).
Wed Oct 27 23:15:41 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:46 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:51 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:15:56 2010 [DM][I]: -- MARK --
Wed Oct 27 23:15:56 2010 [DM][D]: Checking rescheduling conditionsof jobs.
Wed Oct 27 23:15:56 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:15:56 2010 [DM][D]: MAD (builtin) message SCHEDULE_END- SUCCESS - (info length=1).
Wed Oct 27 23:15:56 2010 [TM][I]: -- MARK --
Wed Oct 27 23:15:56 2010 [EM][I]: -- MARK --
Wed Oct 27 23:15:56 2010 [IM][I]: -- MARK --
Wed Oct 27 23:15:56 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:01 2010 [IM][I]: Discovering hosts.
Wed Oct 27 23:16:01 2010 [IM][D]: Discovering hosts with MAD mds4, 1active queries.
Wed Oct 27 23:16:01 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:01 2010 [IM][D]: MAD (mds4) message DISCOVER -SUCCESS bmiclusterapps.cchmc.org (info length=25).Wed Oct 27 23:16:01 2010 [IM][D]: Discovery action done, 0 activequeries.Wed Oct 27 23:16:01 2010 [IM][I]: Hosts discovered by MAD (mds4):bmiclusterapps.cchmc.org
Wed Oct 27 23:16:06 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:11 2010 [DM][D]: Checking rescheduling conditionsof jobs.
Wed Oct 27 23:16:11 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:16:11 2010 [DM][D]: MAD (builtin) message SCHEDULE_END- SUCCESS - (info length=1).
Wed Oct 27 23:16:11 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:16 2010 [EM][I]: Poll timeout of job 0 expired.Checking execution state.
Wed Oct 27 23:16:16 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:16 2010 [EM][D]: Reading from MAD pipe 1.
Wed Oct 27 23:16:16 2010 [EM][D]: MAD message received:"POLL 0SUCCESS ACTIVE".
The following is when I try to kill the job using "gwkill 0"
Wed Oct 27 23:16:23 2010 [RM][I]: Authorizing user velge9, withproxy path "".
Wed Oct 27 23:16:23 2010 [DM][I]: Killing job 0.
Wed Oct 27 23:16:23 2010 [EM][I]: Cancelling job 0.
Wed Oct 27 23:16:23 2010 [EM][D]: Reading from MAD pipe 1.
Wed Oct 27 23:16:23 2010 [EM][D]: MAD message received:"CANCEL 0SUCCESS -".Wed Oct 27 23:16:26 2010 [DM][D]: Checking rescheduling conditionsof jobs.
Wed Oct 27 23:16:26 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:16:26 2010 [DM][D]: MAD (builtin) message SCHEDULE_END- SUCCESS - (info length=1).
Wed Oct 27 23:16:26 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:31 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:36 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:41 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:41 2010 [DM][D]: Checking rescheduling conditionsof jobs.
Wed Oct 27 23:16:41 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:16:41 2010 [DM][D]: MAD (builtin) message SCHEDULE_END- SUCCESS - (info length=1).
Wed Oct 27 23:16:46 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:51 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:56 2010 [UM][I]: -- MARK --
Wed Oct 27 23:16:56 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:16:56 2010 [DM][D]: Checking rescheduling conditionsof jobs.
Wed Oct 27 23:16:56 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:16:56 2010 [DM][D]: MAD (builtin) message SCHEDULE_END- SUCCESS - (info length=1).
Wed Oct 27 23:17:01 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:17:06 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:17:06 2010 [IM][D]:       Monitoring host 0.
Wed Oct 27 23:17:06 2010 [IM][D]: Monitoring host 0("bmiclusterapps.cchmc.org"), 1 active queries.Wed Oct 27 23:17:06 2010 [IM][D]: MAD (mds4) message MONITOR 0SUCCESS HOSTNAME="bmiclusterapps.cchmc.org" ARCH="NULL"OS_NAME="NULL" OS_VERSION="NULL" CPU_MODEL="NULL" CPU_MHZ=0 CPU_FREE=0 CPU_SMP=0 NODECOUNT=264SIZE_MEM_MB=0 FREE_MEM_MB=0 SIZE_DISK_MB=0 FREE_DISK_MB=0FORK_NAME="Fork" LRMS_NAME="LSF" LRMS_TYPE="lsf" QUEUE_NAME[0]="pdxpop" QUEUE_NODECOUNT[0]=264QUEUE_FREENODECOUNT[0]=264 QUEUE_MAXTIME[0]=0 QUEUE_MAXCPUTIME[0]=-1QUEUE_MAXCOUNT[0]=0 QUEUE_MAXRUNNINGJOBS[0]=0 QUEUE_MAXJOBSINQUEUE[0]=0 QUEUE_STATUS[0]="enabled"QUEUE_DISPATCHTYPE[0]="NULL" QUEUE_PRIORITY[0]="0"QUEUE_NAME[1]="priority" QUEUE_NODECOUNT[1]=264 QUEUE_FREENODECOUNT[1]=264 QUEUE_MAXTIME[1]=0QUEUE_MAXCPUTIME[1]=-1 QUEUE_MAXCOUNT[1]=0 QUEUE_MAXRUNNINGJOBS[1]=0QUEUE_MAXJOBSINQUEUE[1]=0 QUEUE_STATUS[1]="enabled" QUEUE_DISPATCHTYPE[1]="NULL" QUEUE_PRIORITY[1]="0"QUEUE_NAME[2]="night" QUEUE_NODECOUNT[2]=264QUEUE_FREENODECOUNT[2]=264 QUEUE_MAXTIME[2]=0QUEUE_MAXCPUTIME[2]=-1 QUEUE_MAXCOUNT[2]=0 QUEUE_MAXRUNNINGJOBS[2]=0QUEUE_MAXJOBSINQUEUE[2]=0 QUEUE_STATUS[2]="enabled"QUEUE_DISPATCHTYPE[2]="NULL" QUEUE_P
RIORITY[2]="0" (info length=2384).
Wed Oct 27 23:17:06 2010 [IM][D]: Monitoring action done, 0 activequeries.
Wed Oct 27 23:17:06 2010 [IM][D]: Host 0 successfully monitored.
Wed Oct 27 23:17:11 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:17:11 2010 [DM][D]: Checking rescheduling conditionsof jobs.
Wed Oct 27 23:17:11 2010 [DM][I]: Scheduling 1 jobs (0 arrays).
Wed Oct 27 23:17:11 2010 [DM][D]: MAD (builtin) message SCHEDULE_END- SUCCESS - (info length=1).Wed Oct 27 23:17:16 2010 [EM][I]: Poll timeout of job 0 expired.Checking execution state.
Wed Oct 27 23:17:16 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:17:21 2010 [IM][D]: Checking hosts starting with 0...
Wed Oct 27 23:17:26 2010 [DM][D]: Checking rescheduling conditionsof jobs.
Even though it says "Poll timeout of job 0 expired. Checkingexecution state.", it never runs the "POLL 0 SUCCESS ACTIVE" thinglike earlier. May be that is what is causing this issue??!!
Thanks,
Prakash

Re: [gt-user] Gridway jobs not getting cancelled correctly when using gwkill

Reply via email to