Hi Jeff,
You are right about the memory issue. I replaced the two instances of first
argument in globus_callback_register_oneshot with NULL as you suggested and SEG
seem to be working ok after that. I get states reporting as Active and Done
after that. Thanks again for your help.
replaced
>
> result = globus_callback_register_oneshot(
> &logfile_state->callback,
> &delay,
> globus_l_sge_read_callback,
> logfile_state);
>
with
result = globus_callback_register_oneshot(
NULL,
&delay,
globus_l_sge_read_callback,
logfile_state);
Prakashan
-----Original Message-----
From: Jeff Porter [mailto:[EMAIL PROTECTED]
Sent: Fri 11/7/2008 10:18 AM
To: Korambath, Prakashan
Cc: [EMAIL PROTECTED]; Jin, Kejian; [EMAIL PROTECTED]
Subject: Re: [gt-user] Issues with Globus Tookit 4.2 GRAM and SGE-SEG with SGE
6.2; job status is always unsubmitted
Hi Prakashan,
Thanks for passing this on. Like I said, I don't think it's great
solution (particularly on systems where the reporting file can get big)
but a simple short term one.
btw: I noticed I had a typo in the vdt link I sent (an "_" instead of a
"-"). The link is
http://vdt.cs.wisc.edu/software/sge-jobmanager/1.1-p5-1//src/globus_scheduler_event_generator_sge-1.1.tar.gz
<http://vdt.cs.wisc.edu/software/sge-jobmanager/1.1-p5-1//src/globus_scheduler_event_generator_sge_1.1.tar.gz>
thanks again,
Jeff
Korambath, Prakashan wrote:
>
> Hi Jeff,
>
> This is the suggestion from Richard in the SGE mailiing list for the
> Arco/dbwriter problem. I suppose we can always tell the
> $GLOBUS_LOCATION/etc/globus-sge.conf file to look at any log_path we want.
>
> Prakashan
>
>
>
>
> Setup a cron jobs which duplicates the reporting file:
>
>
> if [ ! -r $SGE_ROOT/$SGE_CELL/common/reporting_for_arco -a
> ! -r $SGE_ROOT/$SGE_CELL/common/reporting_for_globus ]; then
>
> # move the current reporting file into a tmp file
> # qmaster will recreate the reporting file soon
> mv $SGE_ROOT/$SGE_CELL/common/reporting \
> $SGE_ROOT/$SGE_CELL/common/reporting.tmp
>
> # Append the reporting file to the reporting file for globus
> cat $SGE_ROOT/$SGE_CELL/common/reporting.tmp \
> >> $SGE_ROOT/$SGE_CELL/common/reporting_for_globus
>
> # Rename the tmp reporting for, dbwriter will process it
> mv $SGE_ROOT/$SGE_CELL/common/reporting.tmp \
> $SGE_ROOT/$SGE_CELL/common/reporting_for_arco
>
> fi
>
>
> In dbwriter.conf (configuration file or dbwriter) the path to the
> reporting file
> is defined:
>
> % cat $SGE_ROOT/$SGE_CELL/common/dbwriter.conf
> ...
> #
> # File name of reporting file
> #
> DBWRITER_REPORTING_FILE=$SGE_ROOT/$SGE_CELL/common/reporting
> ...
>
> I hope that the path to the reporting file is not hard coded in globus.
>
> I used such a script already for testing different database system. The
> reporting of one cluster has been processed by two dbwriter instances.
> Once was
> writing into and postgres database, one was writing into a mysql database.
>
> Richard
>
>
> -----Original Message-----
> From: Jeff Porter [mailto:[EMAIL PROTECTED]
> Sent: Thu 11/6/2008 3:22 PM
> To: Korambath, Prakashan
> Cc: [EMAIL PROTECTED]; Jin, Kejian; [EMAIL PROTECTED]
> Subject: Re: [gt-user] Issues with Globus Tookit 4.2 GRAM and SGE-SEG
> with SGE 6.2; job status is always unsubmitted
>
>
> Hi Prakashan,
>
> You're right that changing the SGE code might be easier to maintain but
> I never thought of the 2 file solutions as a good one - just a quick
> one. I did speak with one of the ARCO developers about changing the
> dbwriter but that didn't seem plausible from their end. The other
> solution that seems more realistic is to have the SEG be able to get
> this information from different sources via some pluggin - e.g. from
> reporting file, arco-db, something even lighter - depending on some flag
> in the globus_sge.conf file.
>
> The seg_pbs_module.c version is quite different since pbs has an
> internal logfile rotation mechanism that the seg understands. When I
> compare 4.0.8 and 4.2.1 versions of the pbs_module, I only see the one
> change you've noted.
>
> I do know there is one memory leak with the LeSC version that has been
> fixed in the vdt version. You might making that change. That LeSC
> version contains
>
> result = globus_callback_register_oneshot(
> &logfile_state->callback,
> &delay,
> globus_l_sge_read_callback,
> logfile_state);
>
> However, if the 1st argument isn't null, the function makes a copy of
> the memory (it may even try to take ownership of the memory, I don't
> remember right now). You can compare with the pbs version. It occurs
> twice in the module but the leak is small. Perhaps this causes
> additional problems in gt4.2?
>
> You can fix your version or grab the vdt version which includes this fix:
>
> http://vdt.cs.wisc.edu/software/sge-jobmanager/1.1-p5-1//src/globus_scheduler_event_generator_sge_1.1.tar.gz
>
> The vdt version also handles 'reporting' file rotation. It does not have
> the gt4.2 fix you mention here.
>
> Thanks, Jeff
>
>
> Korambath, Prakashan wrote:
> >
> > Hi Jeff,
> >
> > Regarding the Arco/gt4: Isn't it better if someone changes the SGE
> > source code to write an additional file, say seg-reporting or
> > something like that? I can work with you on that no problem here. If
> > we can get SGE developers do that then changes will be there in their
> > source code distribution.
> >
> > For the SEG update issue this is what I did:
> >
> >
> > I just modified the file from here
> > http://www.lesc.ic.ac.uk/projects/SGE-GT4.html
> >
> > globus_scheduler_event_generator_sge-1.1.tar.gz
> >
> > I saved the contents of someone else's post several weeks ago because
> > I thought it would be useful to me.
> >
> > For everybody who's interested:
> > I just had to replace the section
> >
> > **********************************
> > globus_module_descriptor_t
> > globus_scheduler_event_module_ptr =
> > {
> > "globus_scheduler_event_generator_sge",
> > globus_l_sge_module_activate,
> > globus_l_sge_module_deactivate,
> > NULL,
> > NULL,
> > &local_version,
> > NULL
> > };
> > *********************************
> >
> > in the seg_sge_module.c from the
> > globus_scheduler_event_generator_sge-1.1.tar.gz package with the
> > following:
> >
> > *********************************
> > GlobusExtensionDefineModule(globus_seg_sge) =
> > {
> > "globus_seg_sge",
> > globus_l_sge_module_activate,
> > globus_l_sge_module_deactivate,
> > NULL,
> > NULL,
> > &local_version
> >
> > };
> > **************************************
> >
> > Without the above change I was getting the error below.
> >
> > 2008-11-04T08:06:45.415-08:00 ERROR seg.SchedulerEventGenerator
> > [SEG-sge-Thread,run:230] SEG Terminated with
> > globus_scheduler_event_generator: Invalid module sge: activation failed
> > 2008-11-04T08:06:55.450-08:00 ERROR seg.SchedulerEventGenerator
> > [SEG-sge-Thread,run:230] SEG Terminated with
> > globus_scheduler_event_generator: Invalid module sge: activation failed
> > 2008-11-04T08:07:05.504-08:00 INFO impl.DefaultIndexService
> > [ServiceThread-60,performDefaultRegistrations:261]
> > guid=9fceec90-aa8a-11dd-9507-895ddbf3eafc
> > event=org.globus.mds.index.performDefaultRegistrations.end status=0
> > 2008-11-04T08:07:05.505-08:00 ERROR seg.SchedulerEventGenerator
> > [SEG-sge-Thread,run:230] SEG Terminated with
> > globus_scheduler_event_generator: Invalid module sge: activation failed
> >
> >
> > So I modified the seg_sge_module.c file and re-installed the event
> > generator
> >
> > gpt-build --force globus_scheduler_event_generator_sge-1.1.tar.gz
> gcc64dbg
> >
> > After gpt-postinstall the error went away. I just compared the new
> > seg_pbs_module.c from GT 4.2 distribution with the seg_sge_module.c
> > from London e-science and is seeing lot of differences. May be I
> > should rewrite it according to the current seg_pbs_module.c.
> >
> > Prakashan
> >
> >
> > -----Original Message-----
> > From: Jeff Porter [mailto:[EMAIL PROTECTED]
> > Sent: Thu 11/6/2008 1:48 PM
> > To: Korambath, Prakashan
> > Cc: [EMAIL PROTECTED]; Jin, Kejian; [EMAIL PROTECTED]
> > Subject: Re: [gt-user] Issues with Globus Tookit 4.2 GRAM and SGE-SEG
> > with SGE 6.2; job status is always unsubmitted
> >
> >
> > This is odd. The code appears to be missing the 'delivered' line, but
> > that doesn't seem reasonable. You say you made some changes to the
> > seg_sge_module.c file for 4.2 compatibility. Have these changes worked
> > before or is this all new investigations? I'd like to see what you had
> > to fix. Could you send me you seg_sge_module.c?
> >
> > as for the gt4/ARCO mismatch - I've wanted to find/develop a solution
> > for this problem for a while but haven't been able to devote any time to
> > it. One simple solution would be to have a small script/daemon read the
> > sge reporting file and create a second file that is read by the
> > dbwriter. That way the original reporting file is maintained. Would
> > you like to collaborate on putting together/testing something like that?
> >
> > Thanks, Jeff
> >
> > Korambath, Prakashan wrote:
> > >
> > > Hi Jeff,
> > >
> > > The reporting file looks ok to me. I just submitted one job and below
> > > is the output. Do we have another alternative for reporting file if
> > > someone is running Arco's dbwriter?
> > >
> > > Prakashan
> > >
> > >
> > >
> >
> 1226006078:new_job:1226006078:29:-1:NONE:sge_job_script.20845:ppk:staff::defaultdepartment:sge:1024
> > >
> >
> 1226006078:job_log:1226006078:pending:29:-1:NONE::ppk:grid4.ats.ucla.edu:0:1024:1226006078:sge_job_script.20845:ppk:staff::defaultdepartment:sge:new
> > > job
> > >
> >
> 1226006081:job_log:1226006081:sent:29:0:NONE:t:master:grid4.ats.ucla.edu:0:1024:1226006078:sge_job_script.20845:ppk:staff::defaultdepartment:sge:sent
> > > to execd
> > >
> >
> 1226006081:job_log:1226006081:delivered:29:0:NONE:r:master:grid4.ats.ucla.edu:0:1024:1226006078:sge_job_script.20845:ppk:staff::defaultdepartment:sge:job
> > > received by execd
> > >
> >
> 1226006092:acct:all.q:grid4.ats.ucla.edu:staff:ppk:sge_job_script.20845:29:sge:0:1226006078:1226006081:1226006091:0:0:10:0.111982:0.059990:0.000000:0:0:0:0:18747:0:0:0.000000:0:0:0:0:219:85:NONE:defaultdepartment:NONE:1:0:0.171972:0.000000:0.000000:NONE:0.000000:NONE:127770624.000000:0:0
> > > 1226006092:job_log:1226006092:finished:29:0:NONE:r:execution
> > >
> >
> daemon:grid4.ats.ucla.edu:0:1024:1226006078:sge_job_script.20845:ppk:staff::defaultdepartment:sge:job
> > > exited
> > >
> >
> 1226006092:job_log:1226006092:finished:29:0:NONE:r:master:grid4.ats.ucla.edu:0:1024:1226006078:sge_job_script.20845:ppk:staff::defaultdepartment:sge:job
> > > waits for schedds deletion
> > >
> >
> 1226006093:host:grid4.ats.ucla.edu:1226006093:X:cpu=1.200000,np_load_avg=0.150000,mem_free=7214.328125M,virtual_free=15215.441406M
> > >
> >
> 1226006096:job_log:1226006096:deleted:29:0:NONE:T:scheduler:grid4.ats.ucla.edu:0:1024:1226006078:sge_job_script.20845:ppk:staff::defaultdepartment:sge:job
> > > deleted by schedd
> > >
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Jeff Porter [mailto:[EMAIL PROTECTED]
> > > Sent: Thu 11/6/2008 1:12 PM
> > > To: Korambath, Prakashan
> > > Cc: [EMAIL PROTECTED]; Jin, Kejian; [EMAIL PROTECTED]
> > > Subject: Re: [gt-user] Issues with Globus Tookit 4.2 GRAM and SGE-SEG
> > > with SGE 6.2; job status is always unsubmitted
> > >
> > > Hi Prakashan,
> > >
> > > When you run your test with the SEG_SGE_DEBUG level set, what
> > > corresponding entries do you see in the reporting file? either
> 'tail -f'
> > > the file and or grep on "job_log" and the job id.
> > >
> > > BTW: ARCO's dbwriter does delete the reporting file as it's checkpoint
> > > mechanism so that's still an incompatibility with gt4.
> > >
> > > thanks, Jeff
> > >
> > > Korambath, Prakashan wrote:
> > > >
> > > > Hi,
> > > >
> > > > I am trying to sort out some issues with Integrating Globus
> ToolKit
> > > > 4.2 and SGE 6.2 SEG. Some of the issues have already been
> answered in
> > > > the mailing list and I have followed those answers and they work
> > > > correctly, but I am having at least couple of issues.
> > > >
> > > > For example command below
> > > >
> > > > 1. globusrun-ws -debug -batch -submit -o job_epr -factory
> > > > "globushostname" -Ft SGE -f sleep.xml
> > > > submits and runs the job ok, but command below
> > > >
> > > >
> > > > 2. globusrun-ws -debug -status -job-epr-file job_epr
> > > >
> > > > This command always return status unsubmitted even when job is long
> > > gone.
> > > >
> > > > Current job state: Unsubmitted
> > > >
> > > > I checked the $SGE_ROOT/$SGE_CELL/common/reporting file and the
> file.
> > > > I found this file disappearing when SGE's ARCO dbwriter is also
> > > > running. For testing purpose I stopped the postgresql and stopped
> > > > ARCO from doing anything to that file. So now that file is
> there, but
> > > > still SEG is not getting updates like pending, finished etc.
> > > > Everything is fine with Fork, so there is some problem with SGE-SEG.
> > > >
> > > > I also set
> > > >
> > > > export SEG_SGE_DEBUG=3 and ran
> > > > /home/globus/gt4.2.1/libexec/globus-scheduler-event-generator -s sge
> > > > -t 1225815907
> > > >
> > > >
> > > > globus_l_sge_split_into_fields()
> > > > globus_l_sge_split_into_fields(): exit success
> > > > New event: job 28 now pending
> > > > freeing fields
> > > > globus_l_sge_parse_events() exits
> > > > globus_l_sge_clean_buffer() called
> > > > globus_l_sge_split_into_fields()
> > > > globus_l_sge_split_into_fields(): exit success
> > > > New event: job 28 now completed
> > > > freeing fields
> > > > globus_l_sge_split_into_fields()
> > > > globus_l_sge_split_into_fields(): exit success
> > > >
> > > >
> > > > So the scheduler event generator seems to get the status. My
> > > > suspicion is that something is missing in the file seg_sge_module.c.
> > > > I already have changes mentioned here
> > > >
> > >
> >
> http://www.globus.org/toolkit/docs/4.2/4.2.0/execution/gram4/developer/scheduler-tutorial-seg.html
> > > >
> > > > I wonder what else is missing.
> > > >
> > > >
> > > > Prakashan
> > > >
> > > >
> > > >
> > >
> >
> >
>