Excellent. Out of curiosity, what was your use case for having a
second condor-based jobmanager?
Charles
On Feb 17, 2009, at 10:35 AM, Andre Charbonneau wrote:
Thanks for the pointer Charles. Indeed, the job IDs didn't match.
After digging more into the logs and code, I noticed that this was
caused by the following:
- Globus checks for the 'condorness' of a local resource manager in
ManagedExecutableJobResource.java. This checks the name of the
resource
manager to see if it matches
ManagedJobFactoryConstants.FACTORY_TYPE.CONDOR. Since my new local
resource manager is named 'foo', the above check fails and because of
that the 'emitCondorProcesses' attribute is never set.
- Since the emitCondorProcesses is not set, this affects the following
piece of code in foo.in:
if($job_id ne '')
{
$status = Globus::GRAM::JobState::PENDING;
if ($description->emit_condor_processes()) {
$job_id = join(',', map { sprintf("%03d.%03d.%03d",
$job_id, $_, 0) }
(0..($description->count()-1)));
}
return {JOB_STATE => Globus::GRAM::JobState::PENDING,
JOB_ID => $job_id};
}
Basically, the job ID will not be transformed into the appropriate
condor format, so the job state notifications will not work.
To fix this, I simply removed the condition in the code above,
re-installed my foo job manager and job notifications are working
fine now.
Thanks again for the feedback.
Regards,
Andre
Charles Bacon wrote:
Are you sure that the Job IDs referenced in the SEG output (looks
like
xxx.yyy.zzz) match the Job IDs that WS-GRAM thinks it has gotten back
from the perl jobmanagers? I've done one of these second-condor
jobmanagers before for OSG's ManagedFork jobmanager, and there was
some
problem where the scripts were reporting xxx.0, but the SEG was
reporting on xxx.000.000. WS-GRAM won't realize that those are
supposed
to be the same, so you can either modify the behavior of your
foo.pm or
your SEG so they match up. If it's not obvious at your current
level of
logging, bump the GRAM logging up to DEBUG in the
container-log4j.properties file.
Charles
On Feb 16, 2009, at 11:28 AM, Andre Charbonneau wrote:
Greetings,
I'm currently trying to deploy a new job manager and scheduler event
generator and I'm having some problems.
Basically, what I am trying to do is to have a second condor job
manager
and scheduler interface and SEG module, but with a different name
(foo).
To get started, I simply cloned the code from the existing condor
job
manager, scheduler provider and SEG module and changed the names
in the
various files to refer to 'foo' instead of 'condor'.
So far, I'm able to submit my job and the job runs to completion.
The
problem I'm having is that the globusrun-ws client does not seem
to get
any notifications, even though my SEG module seem to be working
fine.
It simply waits forever after I submit my job. For example:
globusrun-ws -submit -s -Ft Foo -Jf creds.epr -Sf creds.epr -Tf
creds.epr -F ******* -f myjob.xml
Submitting job...Done.
Job ID: uuid:f4df4ba2-fc4d-11dd-8f66-00b0d0e1435d
Termination time: 02/17/2009 17:19 GMT
Current job state: Unsubmitted
I checked if my SEG module is running and it looks ok:
ps -ef |grep globus-scheduler-event-generator
globus 26288 26229 0 11:53 ? 00:00:00
/usr/local/globus/libexec/globus-scheduler-event-generator -s foo -t
1234802181
globus 26300 26229 0 11:53 ? 00:00:00
/usr/local/globus/libexec/globus-scheduler-event-generator -s fork
-t
1234801181
globus 26744 26229 0 12:10 ? 00:00:00
/usr/local/globus/libexec/globus-scheduler-event-generator -s
condor -t
1234803725
globus 26748 2895 0 12:10 pts/0 00:00:00 grep
globus-scheduler-event-generator
And when I run it by hand, it looks like it is behaving OK too:
001;1234804765;098.000.000;1;0
001;1234804806;098.000.000;2;0
001;1234804823;098.000.000;8;0
I've compared my code with the one from the Condor job manager and I
can't find what I'm missing.
Anyone one else had similar issues when deploying their custom
made job
managers? Anything else than the SEG module that is required for
the
job state notifications to be properly sent to the client?
(I'm using gt 4.0.8)
Thanks,
Andre
--
Andre Charbonneau
Research Computing Support, IMSB
National Research Council Canada
100 Sussex Drive, Rm 2025
Ottawa, ON, Canada K1A 0R6
613 993-3129