Hi Brendan,
You can view that code in the seg_sge_module.c file of the
globus_scheduler_event_generator_sge-1.1 package at,
http://www.lesc.ic.ac.uk/projects/SGE-GT4.html
There are variations on this code but I don't know whether anyone has
touched the logic for decoding the reporting file information. I just
looked and it is checking the "acct" record in the reporting file for
errors. However, this doesn't seem to be very complete as jobs can fail
and finish without any updated "acct" record being written.
Perhaps others have implemented or consider a more robust way to check
for errors? You can get it from the "accounting" file but that isn't
being parsed in this seg_sge_module.
- Jeff
Brendan MacLean wrote:
Hi,
I have been working with Globus for several months now, and we have
had great success with using it to drive a Torque/PBS scheduled cluster.
Recently I have been trying to get our system working with SGE, where
I have found that GRAM never reports a failure, but always successful
completion, though a qacct -j on the job in question clearly reports
non-zero exit status. After an initial wrong turn of trying to edit
the poll() method in sge.pm <http://sge.pm>, I found that this method
is never called, and a Google search brought me to a previous message
on this list:
http://www.mail-archive.com/[email protected]/msg03507.html
Where I found that Globus depends on the
$SGE_ROOT/$SGE_CELL/common/reporting file for its job exit status
information.
We do have SGE generating this, and we have verified that it correctly
reports non-zero exit status for things that GRAM reports as zero.
Would someone mind pointing me to the code that reads this file, so I
can understand what format it is expecting?
Any other pointers on how to address this issue would be much appreciated.
Thanks in advance.
--Brendan