Hi Prakashan,

You're right that changing the SGE code might be easier to maintain but I never thought of the 2 file solutions as a good one - just a quick one. I did speak with one of the ARCO developers about changing the dbwriter but that didn't seem plausible from their end. The other solution that seems more realistic is to have the SEG be able to get this information from different sources via some pluggin - e.g. from reporting file, arco-db, something even lighter - depending on some flag in the globus_sge.conf file.

The seg_pbs_module.c version is quite different since pbs has an internal logfile rotation mechanism that the seg understands. When I compare 4.0.8 and 4.2.1 versions of the pbs_module, I only see the one change you've noted.

I do know there is one memory leak with the LeSC version that has been fixed in the vdt version. You might making that change. That LeSC version contains

   result = globus_callback_register_oneshot(
           &logfile_state->callback,
           &delay,
           globus_l_sge_read_callback,
           logfile_state);

However, if the 1st argument isn't null, the function makes a copy of the memory (it may even try to take ownership of the memory, I don't remember right now). You can compare with the pbs version. It occurs twice in the module but the leak is small. Perhaps this causes additional problems in gt4.2?

You can fix your version or grab the vdt version which includes this fix:

http://vdt.cs.wisc.edu/software/sge-jobmanager/1.1-p5-1//src/globus_scheduler_event_generator_sge_1.1.tar.gz

The vdt version also handles 'reporting' file rotation. It does not have the gt4.2 fix you mention here.

Thanks, Jeff


Korambath, Prakashan wrote:

Hi Jeff,

Regarding the Arco/gt4: Isn't it better if someone changes the SGE source code to write an additional file, say seg-reporting or something like that? I can work with you on that no problem here. If we can get SGE developers do that then changes will be there in their source code distribution.

For the SEG update issue this is what I did:


I just modified the file from here
http://www.lesc.ic.ac.uk/projects/SGE-GT4.html

globus_scheduler_event_generator_sge-1.1.tar.gz

I saved the contents of someone else's post several weeks ago because I thought it would be useful to me.
For everybody who's interested:
I just had to replace the section

**********************************
globus_module_descriptor_t
globus_scheduler_event_module_ptr =
{
    "globus_scheduler_event_generator_sge",
    globus_l_sge_module_activate,
    globus_l_sge_module_deactivate,
    NULL,
    NULL,
    &local_version,
    NULL
};
*********************************

in the seg_sge_module.c from the globus_scheduler_event_generator_sge-1.1.tar.gz package with the following:

*********************************
GlobusExtensionDefineModule(globus_seg_sge) =
{
    "globus_seg_sge",
     globus_l_sge_module_activate,
     globus_l_sge_module_deactivate,
     NULL,
     NULL,
     &local_version

};
**************************************

Without the above change I was getting the error below. 2008-11-04T08:06:45.415-08:00 ERROR seg.SchedulerEventGenerator [SEG-sge-Thread,run:230] SEG Terminated with globus_scheduler_event_generator: Invalid module sge: activation failed 2008-11-04T08:06:55.450-08:00 ERROR seg.SchedulerEventGenerator [SEG-sge-Thread,run:230] SEG Terminated with globus_scheduler_event_generator: Invalid module sge: activation failed 2008-11-04T08:07:05.504-08:00 INFO impl.DefaultIndexService [ServiceThread-60,performDefaultRegistrations:261] guid=9fceec90-aa8a-11dd-9507-895ddbf3eafc event=org.globus.mds.index.performDefaultRegistrations.end status=0 2008-11-04T08:07:05.505-08:00 ERROR seg.SchedulerEventGenerator [SEG-sge-Thread,run:230] SEG Terminated with globus_scheduler_event_generator: Invalid module sge: activation failed


So I modified the seg_sge_module.c file and re-installed the event generator

gpt-build --force globus_scheduler_event_generator_sge-1.1.tar.gz gcc64dbg

After gpt-postinstall the error went away. I just compared the new seg_pbs_module.c from GT 4.2 distribution with the seg_sge_module.c from London e-science and is seeing lot of differences. May be I should rewrite it according to the current seg_pbs_module.c.

Prakashan


-----Original Message-----
From: Jeff Porter [mailto:[EMAIL PROTECTED]
Sent: Thu 11/6/2008 1:48 PM
To: Korambath, Prakashan
Cc: [EMAIL PROTECTED]; Jin, Kejian; [EMAIL PROTECTED]
Subject: Re: [gt-user] Issues with Globus Tookit 4.2 GRAM and SGE-SEG with SGE 6.2; job status is always unsubmitted


This is odd. The code appears to be missing the 'delivered' line, but
that doesn't seem reasonable. You say you made some changes to the
seg_sge_module.c file for 4.2 compatibility. Have these changes worked
before or is this all new investigations?  I'd like to see what you had
to fix. Could you send me you seg_sge_module.c?

as for the gt4/ARCO mismatch - I've wanted to find/develop a solution
for this problem for a while but haven't been able to devote any time to
it.  One simple solution would be to have a small script/daemon read the
sge reporting file and create a second file that is read by the
dbwriter.  That way the original reporting file is maintained.   Would
you like to collaborate on putting together/testing something like that?

Thanks, Jeff

Korambath, Prakashan wrote:
>
> Hi Jeff,
>
> The reporting file looks ok to me.  I just submitted one job and below
> is the output.  Do we have another alternative for reporting file if
> someone is running Arco's dbwriter?
>
> Prakashan
>
>
> 1226006078:new_job:1226006078:29:-1:NONE:sge_job_script.20845:ppk:staff::defaultdepartment:sge:1024 > 1226006078:job_log:1226006078:pending:29:-1:NONE::ppk:grid4.ats.ucla.edu:0:1024:1226006078:sge_job_script.20845:ppk:staff::defaultdepartment:sge:new
> job
> 1226006081:job_log:1226006081:sent:29:0:NONE:t:master:grid4.ats.ucla.edu:0:1024:1226006078:sge_job_script.20845:ppk:staff::defaultdepartment:sge:sent
> to execd
> 1226006081:job_log:1226006081:delivered:29:0:NONE:r:master:grid4.ats.ucla.edu:0:1024:1226006078:sge_job_script.20845:ppk:staff::defaultdepartment:sge:job
> received by execd
> 1226006092:acct:all.q:grid4.ats.ucla.edu:staff:ppk:sge_job_script.20845:29:sge:0:1226006078:1226006081:1226006091:0:0:10:0.111982:0.059990:0.000000:0:0:0:0:18747:0:0:0.000000:0:0:0:0:219:85:NONE:defaultdepartment:NONE:1:0:0.171972:0.000000:0.000000:NONE:0.000000:NONE:127770624.000000:0:0
> 1226006092:job_log:1226006092:finished:29:0:NONE:r:execution
> daemon:grid4.ats.ucla.edu:0:1024:1226006078:sge_job_script.20845:ppk:staff::defaultdepartment:sge:job
> exited
> 1226006092:job_log:1226006092:finished:29:0:NONE:r:master:grid4.ats.ucla.edu:0:1024:1226006078:sge_job_script.20845:ppk:staff::defaultdepartment:sge:job
> waits for schedds deletion
> 1226006093:host:grid4.ats.ucla.edu:1226006093:X:cpu=1.200000,np_load_avg=0.150000,mem_free=7214.328125M,virtual_free=15215.441406M > 1226006096:job_log:1226006096:deleted:29:0:NONE:T:scheduler:grid4.ats.ucla.edu:0:1024:1226006078:sge_job_script.20845:ppk:staff::defaultdepartment:sge:job
> deleted by schedd
>
>
>
>
> -----Original Message-----
> From: Jeff Porter [mailto:[EMAIL PROTECTED]
> Sent: Thu 11/6/2008 1:12 PM
> To: Korambath, Prakashan
> Cc: [EMAIL PROTECTED]; Jin, Kejian; [EMAIL PROTECTED]
> Subject: Re: [gt-user] Issues with Globus Tookit 4.2 GRAM and SGE-SEG
> with SGE  6.2; job status is always unsubmitted
>
> Hi Prakashan,
>
> When you run your test with the SEG_SGE_DEBUG level set, what
> corresponding entries do you see in the reporting file? either 'tail -f'
> the file and or grep on "job_log" and the job id.
>
> BTW: ARCO's dbwriter does delete the reporting file as it's checkpoint
> mechanism so that's still an incompatibility with gt4.
>
> thanks, Jeff
>
> Korambath, Prakashan wrote:
> >
> > Hi,
> >
> >   I am trying to sort out some issues with Integrating Globus ToolKit
> > 4.2 and SGE 6.2 SEG.  Some of the issues have already been answered in
> > the mailing list and I have followed those answers and they work
> > correctly, but I am having at least couple of issues.
> >
> > For example command below
> >
> > 1. globusrun-ws -debug -batch -submit -o job_epr -factory
> > "globushostname" -Ft SGE -f sleep.xml
> > submits and runs the job ok, but command below
> >
> >
> > 2. globusrun-ws -debug -status -job-epr-file job_epr
> >
> > This command always return status unsubmitted even when job is long
> gone.
> >
> > Current job state: Unsubmitted
> >
> > I checked the $SGE_ROOT/$SGE_CELL/common/reporting file and the file.
> > I found this file disappearing when SGE's ARCO dbwriter is also
> > running.  For testing purpose I stopped the postgresql and stopped
> > ARCO from doing anything to that file. So now that file is there, but
> > still SEG is not getting updates like pending, finished etc.
> > Everything is fine with Fork, so there is some problem with SGE-SEG.
> >
> > I also set
> >
> > export SEG_SGE_DEBUG=3 and ran
> > /home/globus/gt4.2.1/libexec/globus-scheduler-event-generator -s sge
> > -t 1225815907
> >
> >
> > globus_l_sge_split_into_fields()
> > globus_l_sge_split_into_fields(): exit success
> > New event: job 28 now pending
> > freeing fields
> > globus_l_sge_parse_events() exits
> > globus_l_sge_clean_buffer() called
> > globus_l_sge_split_into_fields()
> > globus_l_sge_split_into_fields(): exit success
> > New event: job 28 now completed
> > freeing fields
> > globus_l_sge_split_into_fields()
> > globus_l_sge_split_into_fields(): exit success
> >
> >
> > So the scheduler event generator seems to get the status.  My
> > suspicion is that something is missing in the file seg_sge_module.c.
> > I already have changes mentioned here
> >
> http://www.globus.org/toolkit/docs/4.2/4.2.0/execution/gram4/developer/scheduler-tutorial-seg.html
> >
> > I wonder what else is missing.
> >
> >
> > Prakashan
> >
> >
> >
>


Reply via email to