Hi Jeff,

I had found the code there (thanks for confirming it is a primary source),
and even located the problem in seg_sge_module.c as being the hard-coded
index 13 for the location of the exit-status, where it is 14 on our system.
Indeed, the column at 13 is zero in all the cases I have been running, while
14 is 1 (or zero for success).  (The ability to configure and a little
documentation for something like this would be nice, of course.)

Also, for a "job_log::deleted" record, I will want a failure, rather than
the successful completion currently reported.  I believe this works
correctly for PBS.

This may not be a complete solution, but these fixes will be a huge
improvement over what we have.  After that we'll see how many unreported
errors we run into.

We have not yet built Globus on this machine.  At least it looks like my
sysadmin used a binary install.  When I run "configure" for the
globus_scheduler_event_generator_sge-1.1 package, I get a number of errors
about missing dependencies, and finally:

./configure: line 1442: /usr/local/gt/libexec/globus-build-env-.sh: No such
file or directory

Any suggestions on the shortest route to replacing our version of
libglobus_seg_sge_* with one built from an edited seg_sge_module.c would be
much appreciated.  Otherwise, I am sure we can muddle through this today.

Thanks again for the quick response.

--Brendan

On Thu, Jan 22, 2009 at 10:22 PM, Jeff Porter <[email protected]> wrote:

>
> Hi Brendan,
>
> You can view that code in the seg_sge_module.c file of the
> globus_scheduler_event_generator_sge-1.1 package at,
>
> http://www.lesc.ic.ac.uk/projects/SGE-GT4.html
>
> There are variations on this code but I don't know whether anyone has
> touched the logic for decoding the reporting file information. I just looked
> and it is checking the "acct" record in the reporting file for errors.
>  However, this doesn't seem to be very complete as jobs can fail and finish
> without any updated "acct" record being written.
>
> Perhaps others have implemented or consider a more robust way to check for
> errors?  You can get it from the "accounting" file but that isn't being
> parsed in this seg_sge_module.
>
> - Jeff
>
>
> Brendan MacLean wrote:
>
>> Hi,
>>
>> I have been working with Globus for several months now, and we have had
>> great success with using it to drive a Torque/PBS scheduled cluster.
>>
>> Recently I have been trying to get our system working with SGE, where I
>> have found that GRAM never reports a failure, but always successful
>> completion, though a qacct -j on the job in question clearly reports
>> non-zero exit status.  After an initial wrong turn of trying to edit the
>> poll() method in sge.pm <http://sge.pm>, I found that this method is
>> never called, and a Google search brought me to a previous message on this
>> list:
>>
>> http://www.mail-archive.com/[email protected]/msg03507.html
>>
>> Where I found that Globus depends on the
>> $SGE_ROOT/$SGE_CELL/common/reporting file for its job exit status
>> information.
>>
>> We do have SGE generating this, and we have verified that it correctly
>> reports non-zero exit status for things that GRAM reports as zero.
>>
>> Would someone mind pointing me to the code that reads this file, so I can
>> understand what format it is expecting?
>>
>> Any other pointers on how to address this issue would be much appreciated.
>>
>> Thanks in advance.
>>
>> --Brendan
>>
>

Reply via email to