Thanks for the pointer Charles. Indeed, the job IDs didn't match.
After digging more into the logs and code, I noticed that this was
caused by the following:
- Globus checks for the 'condorness' of a local resource manager in
ManagedExecutableJobResource.java. This checks the name of the resource
manager to see if it matches
ManagedJobFactoryConstants.FACTORY_TYPE.CONDOR. Since my new local
resource manager is named 'foo', the above check fails and because of
that the 'emitCondorProcesses' attribute is never set.
- Since the emitCondorProcesses is not set, this affects the following
piece of code in foo.in:
if($job_id ne '')
{
$status = Globus::GRAM::JobState::PENDING;
if ($description->emit_condor_processes()) {
$job_id = join(',', map { sprintf("%03d.%03d.%03d",
$job_id, $_, 0) }
(0..($description->count()-1)));
}
return {JOB_STATE => Globus::GRAM::JobState::PENDING,
JOB_ID => $job_id};
}
Basically, the job ID will not be transformed into the appropriate
condor format, so the job state notifications will not work.
To fix this, I simply removed the condition in the code above,
re-installed my foo job manager and job notifications are working fine now.
Thanks again for the feedback.
Regards,
Andre
Charles Bacon wrote:
> Are you sure that the Job IDs referenced in the SEG output (looks like
> xxx.yyy.zzz) match the Job IDs that WS-GRAM thinks it has gotten back
> from the perl jobmanagers? I've done one of these second-condor
> jobmanagers before for OSG's ManagedFork jobmanager, and there was some
> problem where the scripts were reporting xxx.0, but the SEG was
> reporting on xxx.000.000. WS-GRAM won't realize that those are supposed
> to be the same, so you can either modify the behavior of your foo.pm or
> your SEG so they match up. If it's not obvious at your current level of
> logging, bump the GRAM logging up to DEBUG in the
> container-log4j.properties file.
>
>
>
> Charles
>
> On Feb 16, 2009, at 11:28 AM, Andre Charbonneau wrote:
>
>> Greetings,
>> I'm currently trying to deploy a new job manager and scheduler event
>> generator and I'm having some problems.
>> Basically, what I am trying to do is to have a second condor job manager
>> and scheduler interface and SEG module, but with a different name (foo).
>> To get started, I simply cloned the code from the existing condor job
>> manager, scheduler provider and SEG module and changed the names in the
>> various files to refer to 'foo' instead of 'condor'.
>>
>> So far, I'm able to submit my job and the job runs to completion. The
>> problem I'm having is that the globusrun-ws client does not seem to get
>> any notifications, even though my SEG module seem to be working fine.
>> It simply waits forever after I submit my job. For example:
>>
>> globusrun-ws -submit -s -Ft Foo -Jf creds.epr -Sf creds.epr -Tf
>> creds.epr -F ******* -f myjob.xml
>> Submitting job...Done.
>> Job ID: uuid:f4df4ba2-fc4d-11dd-8f66-00b0d0e1435d
>> Termination time: 02/17/2009 17:19 GMT
>> Current job state: Unsubmitted
>>
>>
>>
>> I checked if my SEG module is running and it looks ok:
>>
>> ps -ef |grep globus-scheduler-event-generator
>> globus 26288 26229 0 11:53 ? 00:00:00
>> /usr/local/globus/libexec/globus-scheduler-event-generator -s foo -t
>> 1234802181
>> globus 26300 26229 0 11:53 ? 00:00:00
>> /usr/local/globus/libexec/globus-scheduler-event-generator -s fork -t
>> 1234801181
>> globus 26744 26229 0 12:10 ? 00:00:00
>> /usr/local/globus/libexec/globus-scheduler-event-generator -s condor -t
>> 1234803725
>> globus 26748 2895 0 12:10 pts/0 00:00:00 grep
>> globus-scheduler-event-generator
>>
>>
>>
>>
>> And when I run it by hand, it looks like it is behaving OK too:
>>
>> 001;1234804765;098.000.000;1;0
>> 001;1234804806;098.000.000;2;0
>> 001;1234804823;098.000.000;8;0
>>
>>
>> I've compared my code with the one from the Condor job manager and I
>> can't find what I'm missing.
>>
>> Anyone one else had similar issues when deploying their custom made job
>> managers? Anything else than the SEG module that is required for the
>> job state notifications to be properly sent to the client?
>>
>> (I'm using gt 4.0.8)
>>
>>
>> Thanks,
>> Andre
--
Andre Charbonneau
Research Computing Support, IMSB
National Research Council Canada
100 Sussex Drive, Rm 2025
Ottawa, ON, Canada K1A 0R6
613 993-3129