Hi Jeff,
This is the suggestion from Richard in the SGE mailiing list for the
Arco/dbwriter problem. I suppose we can always tell the
$GLOBUS_LOCATION/etc/globus-sge.conf file to look at any log_path we want.
Prakashan
Setup a cron jobs which duplicates the reporting file:
if [ ! -r $SGE_ROOT/$SGE_CELL/common/reporting_for_arco -a
! -r $SGE_ROOT/$SGE_CELL/common/reporting_for_globus ]; then
# move the current reporting file into a tmp file
# qmaster will recreate the reporting file soon
mv $SGE_ROOT/$SGE_CELL/common/reporting \
$SGE_ROOT/$SGE_CELL/common/reporting.tmp
# Append the reporting file to the reporting file for globus
cat $SGE_ROOT/$SGE_CELL/common/reporting.tmp \
>> $SGE_ROOT/$SGE_CELL/common/reporting_for_globus
# Rename the tmp reporting for, dbwriter will process it
mv $SGE_ROOT/$SGE_CELL/common/reporting.tmp \
$SGE_ROOT/$SGE_CELL/common/reporting_for_arco
fi
In dbwriter.conf (configuration file or dbwriter) the path to the reporting file
is defined:
% cat $SGE_ROOT/$SGE_CELL/common/dbwriter.conf
...
#
# File name of reporting file
#
DBWRITER_REPORTING_FILE=$SGE_ROOT/$SGE_CELL/common/reporting
...
I hope that the path to the reporting file is not hard coded in globus.
I used such a script already for testing different database system. The
reporting of one cluster has been processed by two dbwriter instances. Once was
writing into and postgres database, one was writing into a mysql database.
Richard
-----Original Message-----
From: Jeff Porter [mailto:[EMAIL PROTECTED]
Sent: Thu 11/6/2008 3:22 PM
To: Korambath, Prakashan
Cc: [EMAIL PROTECTED]; Jin, Kejian; [EMAIL PROTECTED]
Subject: Re: [gt-user] Issues with Globus Tookit 4.2 GRAM and SGE-SEG with SGE
6.2; job status is always unsubmitted
Hi Prakashan,
You're right that changing the SGE code might be easier to maintain but
I never thought of the 2 file solutions as a good one - just a quick
one. I did speak with one of the ARCO developers about changing the
dbwriter but that didn't seem plausible from their end. The other
solution that seems more realistic is to have the SEG be able to get
this information from different sources via some pluggin - e.g. from
reporting file, arco-db, something even lighter - depending on some flag
in the globus_sge.conf file.
The seg_pbs_module.c version is quite different since pbs has an
internal logfile rotation mechanism that the seg understands. When I
compare 4.0.8 and 4.2.1 versions of the pbs_module, I only see the one
change you've noted.
I do know there is one memory leak with the LeSC version that has been
fixed in the vdt version. You might making that change. That LeSC
version contains
result = globus_callback_register_oneshot(
&logfile_state->callback,
&delay,
globus_l_sge_read_callback,
logfile_state);
However, if the 1st argument isn't null, the function makes a copy of
the memory (it may even try to take ownership of the memory, I don't
remember right now). You can compare with the pbs version. It occurs
twice in the module but the leak is small. Perhaps this causes
additional problems in gt4.2?
You can fix your version or grab the vdt version which includes this fix:
http://vdt.cs.wisc.edu/software/sge-jobmanager/1.1-p5-1//src/globus_scheduler_event_generator_sge_1.1.tar.gz
The vdt version also handles 'reporting' file rotation. It does not have
the gt4.2 fix you mention here.
Thanks, Jeff
Korambath, Prakashan wrote:
>
> Hi Jeff,
>
> Regarding the Arco/gt4: Isn't it better if someone changes the SGE
> source code to write an additional file, say seg-reporting or
> something like that? I can work with you on that no problem here. If
> we can get SGE developers do that then changes will be there in their
> source code distribution.
>
> For the SEG update issue this is what I did:
>
>
> I just modified the file from here
> http://www.lesc.ic.ac.uk/projects/SGE-GT4.html
>
> globus_scheduler_event_generator_sge-1.1.tar.gz
>
> I saved the contents of someone else's post several weeks ago because
> I thought it would be useful to me.
>
> For everybody who's interested:
> I just had to replace the section
>
> **********************************
> globus_module_descriptor_t
> globus_scheduler_event_module_ptr =
> {
> "globus_scheduler_event_generator_sge",
> globus_l_sge_module_activate,
> globus_l_sge_module_deactivate,
> NULL,
> NULL,
> &local_version,
> NULL
> };
> *********************************
>
> in the seg_sge_module.c from the
> globus_scheduler_event_generator_sge-1.1.tar.gz package with the
> following:
>
> *********************************
> GlobusExtensionDefineModule(globus_seg_sge) =
> {
> "globus_seg_sge",
> globus_l_sge_module_activate,
> globus_l_sge_module_deactivate,
> NULL,
> NULL,
> &local_version
>
> };
> **************************************
>
> Without the above change I was getting the error below.
>
> 2008-11-04T08:06:45.415-08:00 ERROR seg.SchedulerEventGenerator
> [SEG-sge-Thread,run:230] SEG Terminated with
> globus_scheduler_event_generator: Invalid module sge: activation failed
> 2008-11-04T08:06:55.450-08:00 ERROR seg.SchedulerEventGenerator
> [SEG-sge-Thread,run:230] SEG Terminated with
> globus_scheduler_event_generator: Invalid module sge: activation failed
> 2008-11-04T08:07:05.504-08:00 INFO impl.DefaultIndexService
> [ServiceThread-60,performDefaultRegistrations:261]
> guid=9fceec90-aa8a-11dd-9507-895ddbf3eafc
> event=org.globus.mds.index.performDefaultRegistrations.end status=0
> 2008-11-04T08:07:05.505-08:00 ERROR seg.SchedulerEventGenerator
> [SEG-sge-Thread,run:230] SEG Terminated with
> globus_scheduler_event_generator: Invalid module sge: activation failed
>
>
> So I modified the seg_sge_module.c file and re-installed the event
> generator
>
> gpt-build --force globus_scheduler_event_generator_sge-1.1.tar.gz gcc64dbg
>
> After gpt-postinstall the error went away. I just compared the new
> seg_pbs_module.c from GT 4.2 distribution with the seg_sge_module.c
> from London e-science and is seeing lot of differences. May be I
> should rewrite it according to the current seg_pbs_module.c.
>
> Prakashan
>
>
> -----Original Message-----
> From: Jeff Porter [mailto:[EMAIL PROTECTED]
> Sent: Thu 11/6/2008 1:48 PM
> To: Korambath, Prakashan
> Cc: [EMAIL PROTECTED]; Jin, Kejian; [EMAIL PROTECTED]
> Subject: Re: [gt-user] Issues with Globus Tookit 4.2 GRAM and SGE-SEG
> with SGE 6.2; job status is always unsubmitted
>
>
> This is odd. The code appears to be missing the 'delivered' line, but
> that doesn't seem reasonable. You say you made some changes to the
> seg_sge_module.c file for 4.2 compatibility. Have these changes worked
> before or is this all new investigations? I'd like to see what you had
> to fix. Could you send me you seg_sge_module.c?
>
> as for the gt4/ARCO mismatch - I've wanted to find/develop a solution
> for this problem for a while but haven't been able to devote any time to
> it. One simple solution would be to have a small script/daemon read the
> sge reporting file and create a second file that is read by the
> dbwriter. That way the original reporting file is maintained. Would
> you like to collaborate on putting together/testing something like that?
>
> Thanks, Jeff
>
> Korambath, Prakashan wrote:
> >
> > Hi Jeff,
> >
> > The reporting file looks ok to me. I just submitted one job and below
> > is the output. Do we have another alternative for reporting file if
> > someone is running Arco's dbwriter?
> >
> > Prakashan
> >
> >
> >
> 1226006078:new_job:1226006078:29:-1:NONE:sge_job_script.20845:ppk:staff::defaultdepartment:sge:1024
> >
> 1226006078:job_log:1226006078:pending:29:-1:NONE::ppk:grid4.ats.ucla.edu:0:1024:1226006078:sge_job_script.20845:ppk:staff::defaultdepartment:sge:new
> > job
> >
> 1226006081:job_log:1226006081:sent:29:0:NONE:t:master:grid4.ats.ucla.edu:0:1024:1226006078:sge_job_script.20845:ppk:staff::defaultdepartment:sge:sent
> > to execd
> >
> 1226006081:job_log:1226006081:delivered:29:0:NONE:r:master:grid4.ats.ucla.edu:0:1024:1226006078:sge_job_script.20845:ppk:staff::defaultdepartment:sge:job
> > received by execd
> >
> 1226006092:acct:all.q:grid4.ats.ucla.edu:staff:ppk:sge_job_script.20845:29:sge:0:1226006078:1226006081:1226006091:0:0:10:0.111982:0.059990:0.000000:0:0:0:0:18747:0:0:0.000000:0:0:0:0:219:85:NONE:defaultdepartment:NONE:1:0:0.171972:0.000000:0.000000:NONE:0.000000:NONE:127770624.000000:0:0
> > 1226006092:job_log:1226006092:finished:29:0:NONE:r:execution
> >
> daemon:grid4.ats.ucla.edu:0:1024:1226006078:sge_job_script.20845:ppk:staff::defaultdepartment:sge:job
> > exited
> >
> 1226006092:job_log:1226006092:finished:29:0:NONE:r:master:grid4.ats.ucla.edu:0:1024:1226006078:sge_job_script.20845:ppk:staff::defaultdepartment:sge:job
> > waits for schedds deletion
> >
> 1226006093:host:grid4.ats.ucla.edu:1226006093:X:cpu=1.200000,np_load_avg=0.150000,mem_free=7214.328125M,virtual_free=15215.441406M
> >
> 1226006096:job_log:1226006096:deleted:29:0:NONE:T:scheduler:grid4.ats.ucla.edu:0:1024:1226006078:sge_job_script.20845:ppk:staff::defaultdepartment:sge:job
> > deleted by schedd
> >
> >
> >
> >
> > -----Original Message-----
> > From: Jeff Porter [mailto:[EMAIL PROTECTED]
> > Sent: Thu 11/6/2008 1:12 PM
> > To: Korambath, Prakashan
> > Cc: [EMAIL PROTECTED]; Jin, Kejian; [EMAIL PROTECTED]
> > Subject: Re: [gt-user] Issues with Globus Tookit 4.2 GRAM and SGE-SEG
> > with SGE 6.2; job status is always unsubmitted
> >
> > Hi Prakashan,
> >
> > When you run your test with the SEG_SGE_DEBUG level set, what
> > corresponding entries do you see in the reporting file? either 'tail -f'
> > the file and or grep on "job_log" and the job id.
> >
> > BTW: ARCO's dbwriter does delete the reporting file as it's checkpoint
> > mechanism so that's still an incompatibility with gt4.
> >
> > thanks, Jeff
> >
> > Korambath, Prakashan wrote:
> > >
> > > Hi,
> > >
> > > I am trying to sort out some issues with Integrating Globus ToolKit
> > > 4.2 and SGE 6.2 SEG. Some of the issues have already been answered in
> > > the mailing list and I have followed those answers and they work
> > > correctly, but I am having at least couple of issues.
> > >
> > > For example command below
> > >
> > > 1. globusrun-ws -debug -batch -submit -o job_epr -factory
> > > "globushostname" -Ft SGE -f sleep.xml
> > > submits and runs the job ok, but command below
> > >
> > >
> > > 2. globusrun-ws -debug -status -job-epr-file job_epr
> > >
> > > This command always return status unsubmitted even when job is long
> > gone.
> > >
> > > Current job state: Unsubmitted
> > >
> > > I checked the $SGE_ROOT/$SGE_CELL/common/reporting file and the file.
> > > I found this file disappearing when SGE's ARCO dbwriter is also
> > > running. For testing purpose I stopped the postgresql and stopped
> > > ARCO from doing anything to that file. So now that file is there, but
> > > still SEG is not getting updates like pending, finished etc.
> > > Everything is fine with Fork, so there is some problem with SGE-SEG.
> > >
> > > I also set
> > >
> > > export SEG_SGE_DEBUG=3 and ran
> > > /home/globus/gt4.2.1/libexec/globus-scheduler-event-generator -s sge
> > > -t 1225815907
> > >
> > >
> > > globus_l_sge_split_into_fields()
> > > globus_l_sge_split_into_fields(): exit success
> > > New event: job 28 now pending
> > > freeing fields
> > > globus_l_sge_parse_events() exits
> > > globus_l_sge_clean_buffer() called
> > > globus_l_sge_split_into_fields()
> > > globus_l_sge_split_into_fields(): exit success
> > > New event: job 28 now completed
> > > freeing fields
> > > globus_l_sge_split_into_fields()
> > > globus_l_sge_split_into_fields(): exit success
> > >
> > >
> > > So the scheduler event generator seems to get the status. My
> > > suspicion is that something is missing in the file seg_sge_module.c.
> > > I already have changes mentioned here
> > >
> >
> http://www.globus.org/toolkit/docs/4.2/4.2.0/execution/gram4/developer/scheduler-tutorial-seg.html
> > >
> > > I wonder what else is missing.
> > >
> > >
> > > Prakashan
> > >
> > >
> > >
> >
>
>