One more mail to give some news. :)
I did what Stuart said: i got the Perl description. I put it in a file, and
then launched manually the command.
Here is the Perl description:
$description = {
directory => [ '/home/fhornoy' ],
condoros => [ 'LINUX' ],
condorarch => [ 'INTEL' ],
stderr => [ '/dev/null' ],
environment => [ [ 'GLOBUS_LOCATION', '/opt/globus' ], [
'X509_CERT_DIR', '/etc/grid-security/certificates' ], [ 'X509_USER_PROXY',
'' ], [ 'X509_USER_CERT', '' ], [ 'X509_USER_KEY', '' ], [ 'HOME', '/home/f\
hornoy' ], [ 'LOGNAME', 'fhornoy' ], [ 'SCRATCH_DIRECTORY',
'/home/fhornoy/.globus/scratch' ], [ 'JAVA_HOME',
'/usr/java/jdk1.5.0_07/jre' ], [ 'GLOBUS_GRAM_JOB_HANDLE', '
https://193.48.145.106:8443/wsrf/services\
/ManagedExecutableJobService?1d9988b8-3c20-11dc-908d-0017f23158ca' ], ],
xmlextensions => [ '1' ],
executable => [ '/bin/hostname' ],
factoryendpoint => [ 'Address:
https://193.48.145.106:8443/wsrf/services/ManagedJobFactoryService
Reference property[0]:
<ns5:ResourceID ns04:type="ns05:string" xmlns:ns04="
http://www.w3.org/2001/XMLSchema-instance" xmlns:ns05="
http://www.w3.org/2001/XMLSchema" xmlns:ns5="
http://www.globus.org/namespaces/2004/10/gram/job">SGE</ns5\
:ResourceID>
' ],
stdin => [ '/dev/null' ],
jobdir => [ '/home/fhornoy/.globus/1d9988b8-3c20-11dc-908d-0017f23158ca'
],
jobtype => [ 'multiple' ],
stdout => [ '/dev/null' ],
count => [ '1' ],
};
Here is the command i launch:
/usr/bin/sudo -H -u fhornoy -S
/opt/globus/libexec/globus-gridmap-and-execute -g
/etc/grid-security/grid-mapfile /opt/globus/libexec/globus-
job-manager-script.pl -m sge -f description.txt -c submit
And it fails on GRAM_ERROR:24. But (remember my previous mails) it's normal
because in the perl description, "jobtyp" is set to "multiple" and it fails
because my "PE environment is not set".
So in the perl description, i change "multiple" to "single" and it works
fine.
So: is it normal that a "-c /bin/hostname" is interpreted as a "multiple"
job? If yes, how to set up my PE environment please. If no, i'll try to find
what's going wrong. :)
Cheers,
Francois.
On 7/26/07, Francois Hornoy <[EMAIL PROTECTED]> wrote:
>
>
> Hi,
>
> As said in a previous mail, i've identified the piece of perl code that
> fails. In the debugging informations from the container, i got the line that
> is execute:
>
> /usr/bin/sudo -H -u fhornoy -S
> /opt/globus/libexec/globus-gridmap-and-execute -g
> /etc/grid-security/grid-mapfile /opt/globus/libexec/globus-
> job-manager-script.pl -m sge -f /opt/globus/tmp/gram_job_mgr41857.tmp -c
> submit
>
> And perl logs contain (for the relevant part, pasted at the end of mail):
> Determining job type
> Job is of type multiple
> ERROR: Parallel Environment (PE) failure!
> MPI/multiple job was submitted, but no PE set by neither user nor
> administrator
> GRAM_SCRIPT_ERROR:24
>
> So why a "-c /bin/hostname" is interpreted as a multiple job. I mean: is
> this normal?
> And if it is, how to "set that PE" ?
>
>
> The piece of code: about line 422 of
> /opt/globus/lib/perl/Globus/GRAM/JobManager/sge.pm :
>
> #####
> # Determining job request type.
> #
> print("Determining job type");
> print(" Job is of type " . $description->jobtype());
> if($description->jobtype() eq "mpi" ||
> $description->jobtype() eq "multiple")
> {
> #####
> # Check if RSL attribute parallel_environment is provided
> #
> if($description->parallel_environment())
> {
> $mpi_pe = $description->parallel_environment();
> }
>
> if(!$mpi_pe || $mpi_pe eq "NONE"){
> print("ERROR: Parallel Environment (PE) failure!");
> print(" MPI/multiple job was submitted, but no PE set");
> print(" by neither user nor administrator");
> return Globus::GRAM::Error::INVALID_SCRIPT_REPLY;
> }
> else
> {
> print(" PE is $mpi_pe");
> $sge_job_script->print("#\$ -pe $mpi_pe "
> . $description->count() . "\n");
> }
>
>
> Cheers,
> Francois.
>
>
> On 7/26/07, Francois Hornoy <[EMAIL PROTECTED]> wrote:
> >
> >
> > Hi,
> >
> > On 7/25/07, Stuart Martin <[EMAIL PROTECTED]> wrote:
> > >
> > > You can use the GRAM2 error codes to look up the error.
> > > http://www-unix.globus.org/toolkit/docs/4.0/execution/prewsgram/user-
> > > index.html#s-gram-user-errorcodes
> > >
> > > 24 GLOBUS_GRAM_PROTOCOL_ERROR_INVALID_SCRIPT_REPLY the job
> > > manager
> > > detected an invalid script response
> > >
> > > There is some doc for this problem here:
> > > http://www-unix.globus.org/toolkit/docs/4.0/execution/wsgram/user-
> > > index.html#s-wsgram-user-troubleshooting
> > > Find the heading for "The job manager detected an invalid script
> > > response"
> >
> >
> > Yeah i already read that before posting my mail. I tried putting a umask
> > of 0000 for the local user we are mapped to. That did not solve my problem
> > (or maybe that 0000 umask thing is wrong?).
> >
> >
> > If there is nothing obvious, then there is a section here on
> > > debugging script executions, where you save out the contents of the
> > > perl job description used by the scripts and run the perl submission
> > > command by hand:
> > > http://www-unix.globus.org/toolkit/docs/4.0/execution/wsgram/
> > > developer-index.html#id2565465
> >
> >
> >
> > Ok, i will try that as soon as the globus website is working fine.
> > Thanks.
> >
> > Francois.
> >
> >
> > -Stu
> > >
> > > On Jul 25, 2007, at Jul 25, 6:27 AM, Francois Hornoy wrote:
> > >
> > > >
> > > > Ok. I have some debugging informations now. So i repeat the
> > > context.
> > > >
> > > > Here is the basic command: globusrun-ws -submit -F https://MyIP:
> > > > 8443/wsrf/services/ManagedJobFactoryService -c /bin/hostname
> > > >
> > > > Works fine. If i add the "-Ft SGE" option, i get an error.
> > > >
> > > > On the client side (output of that command above) is:
> > > >
> > > > [EMAIL PROTECTED]:~$ globusrun-ws -submit -F https://MyIP:8443/wsrf/
> > > > services/ManagedJobFactoryService -Ft SGE -c /bin/hostname
> > > > Submitting job...Done.
> > > > Job ID: uuid:9007c47c-3aa0-11dc-86fc-0017f23158ca
> > > > Termination time: 07/26/2007 11:16 GMT
> > > > Current job state: Failed
> > > > Destroying job...Done.
> > > > globusrun-ws: Job failed: Internal fault occurred while running the
> > > > submit script.
> > > > [EMAIL PROTECTED]:~$
> > > >
> > > > On the server side, i attached the container output. It's a bit
> > > > long, i think the relevant lines are here (around line 828 in the
> > > > attached file):
> > > >
> > > > 2007-07-25 13:17:01,862 DEBUG exec.JobManagerScript [Thread-18,run:
> > > > 208] Executing command:
> > > > /usr/bin/sudo -H -u fhornoy -S /opt/globus/libexec/globus-gridmap-
> > > > and-execute -g /etc/grid-security/grid-mapfile /opt/globus/libexec/
> > > > globus- job-manager-script.pl -m sge -f /opt/globus/tmp/
> > > > gram_job_mgr53582.tmp -\
> > > > c submit
> > > > 2007-07-25 13:17:02,056 DEBUG exec.JobManagerScript [Thread-18,run:
> > > > 225] first line: GRAM_SCRIPT_ERROR:24
> > > > 2007-07-25 13:17:02,057 DEBUG exec.JobManagerScript [Thread-18,run:
> > > > 228] Read line: GRAM_SCRIPT_ERROR:24
> > > > 2007-07-25 13:17:02,059 DEBUG exec.JobManagerScript [Thread-18,run:
> > > > 335] failure message: null
> > > > 2007-07-25 13:17:02,059 DEBUG exec.JobManagerScript
> > > > [Thread-18,setDone:345] script is done, setting done flag
> > > > 2007-07-25 13:17:02,060 DEBUG exec.StateMachine
> > > > [RunQueueThread_0,processSubmitState:1105] Done waiting for submit
> > > > script
> > > > 2007-07-25 13:17:02,060 DEBUG exec.StateMachine
> > > > [RunQueueThread_0,processSubmitState:1129] script return code: 24
> > > > 2007-07-25 13:17:02,060 DEBUG exec.StateMachine
> > > > [RunQueueThread_0,processSubmitState:1134] script return code means
> > > > error!
> > > > 2007-07-25 13:17:02,060 DEBUG exec.StateMachine
> > > > [RunQueueThread_0,createFaultFromErrorCode:3027] Creating fault
> > > > from error code 24
> > > > 2007-07-25 13:17:02,066 DEBUG utils.FaultUtils
> > > > [RunQueueThread_0,makeFault:460] Fault Class: class
> > > > org.globus.exec.generated.InternalFaultType
> > > > 2007-07-25 13:17:02,066 DEBUG utils.FaultUtils
> > > > [RunQueueThread_0,makeFault:461] Resource Key: {http://
> > > > www.globus.org/namespaces/2004/10/gram/job}
> > > > ResourceID=9007c47c-3aa0-11dc-86fc-0017f23158ca
> > > > 2007-07-25 13:17:02,066 DEBUG utils.FaultUtils
> > > > [RunQueueThread_0,makeFault:462] Description: Internal fault
> > > > occurred while running the submit script.
> > > > 2007-07-25 13:17:02,067 DEBUG utils.FaultUtils
> > > > [RunQueueThread_0,makeFault:463] Cause: null
> > > > 2007-07-25 13:17:02,067 DEBUG utils.FaultUtils
> > > > [RunQueueThread_0,makeFault:464] State when failure occurred
> > > > Unsubmitted
> > > > 2007-07-25 13:17:02,067 DEBUG utils.FaultUtils
> > > > [RunQueueThread_0,makeFault:466] Script Command: submit
> > > > 2007-07-25 13:17:02,067 DEBUG utils.FaultUtils
> > > > [RunQueueThread_0,makeFault:467] GT2 Error Code: 24
> > > > 2007-07-25 13:17:02,072 DEBUG utils.FaultUtils
> > > > [RunQueueThread_0,makeFault:519] Script Command: submit
> > > > 2007-07-25 13:17:02,072 DEBUG ManagedJobResourceImpl.
> > > > 9007c47c-3aa0-11dc-86fc-0017f23158ca [RunQueueThread_0,setFault:
> > > > 346] fault element name: InternalFaultType
> > > > 2007-07-25 13:17:02,072 DEBUG ManagedJobResourceImpl.
> > > > 9007c47c-3aa0-11dc-86fc-0017f23158ca [RunQueueThread_0,setFault:
> > > > 350] fault element name: InternalFault
> > > > 2007-07-25 13:17:02,072 DEBUG ManagedJobResourceImpl.
> > > > 9007c47c-3aa0-11dc-86fc-0017f23158ca [RunQueueThread_0,setFault:
> > > > 353] fault element name: internalFault
> > > >
> > > >
> > > > So, i did not find much on google about error 24, and it's not
> > > > very explicit.
> > > >
> > > >
> > > > Cheers,
> > > > Francois.
> > > >
> > > >
> > > > On 7/23/07, alexander.beck-ratzka <alexander.beck-
> > > > [EMAIL PROTECTED]> wrote: On Monday 23 July 2007 14:19, Francois
> > > > Hornoy wrote:
> > > > > On 7/23/07, alexander.beck-ratzka <alexander.beck-
> > > > [EMAIL PROTECTED] > wrote:
> > > > > > On Monday 23 July 2007 11:18, Francois Hornoy wrote:
> > > > > > > Hi Alexander,
> > > > > > >
> > > > > > > I tried a simple example based on yours. It just stageIn a
> > > > file,
> > > > > > > "/bin/cat" it (that's the job), and i stageOut the .out
> > > > and .err files.
> > > > > > >
> > > > > > > I keep having this error (see end of mail).
> > > > > > >
> > > > > > > If i watch globusrun-ws -status -j job.id nad if i watch the
> > > > > > > "Execution Host", i can see that the StageIn step is good,
> > > > the file
> > > > > > > 523.sh is well transferred. But then, it crashes.
> > > > > > >
> > > > > > > Of course, if i do exactly the same thing without "-Ft SGE",
> > > > it works
> > > > > > > perfectly.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Francois.
> > > > > > >
> > > > > > > 2007-07-23 11:10:53,427 INFO
> > > > > > > exec.StateMachine[RunQueueThread_5,logJobAccepted:3193] Job
> > > > > > > 9cd21054-38fc-11dc-b055-0017f23158ca accepted for local user
> > > > 'globus'
> > > > > > > 2007-07-23 11:10:58,370 ERROR
> > > > > > > service.TransferWork[WorkThread-39,run:724] Terminal transfer
> > > > error:
> > > > > > > Error deleting a file
> > > > > > > "/opt/globus/523.out" [Caused by: Server refused performing
> > > the
> > > > > >
> > > > > > request.
> > > > > >
> > > > > > > Custom message: Server refused deleting file (error code 1)
> > > > [Nested
> > > > > > > exception message: Custom message: Unexpected reply: 500-
> > > > Command fai
> > > > > >
> > > > > > led :
> > > > > > > System error in unlink: No such file or directory
> > > > > > > 500-A system call failed: No such file or directory
> > > > > > > 500 End.]]
> > > > > > > Error deleting a file
> > > > > > > "/opt/globus/523.out"
> > > > > > > . Caused by
> > > > > > > org.globus.ftp.exception.ServerException: Server refused
> > > > performing the
> > > > > > > request. Custom message: Server refused deleting file (error
> > > > code 1)
> > > > > > > [Nested exception message: Custom message: Unexpected reply:
> > > > 500-Comm
> > > > > >
> > > > > > and
> > > > > >
> > > > > > > failed : System error in unlink: No such file or directory
> > > > > > > 500-A system call failed: No such file or directory
> > > > > > > 500 End.]. Nested exception is
> > > > > > > org.globus.ftp.exception.UnexpectedReplyCodeException:
> > > > Custom message:
> > > > > > > Unexpected reply: 500-Command failed : System error in
> > > > unlink: No such
> > > > > >
> > > > > > file
> > > > > >
> > > > > > > or directory
> > > > > > > 500-A system call failed: No such file or directory
> > > > > > > 500 End.
> > > > > > > at org.globus.ftp.vanilla.FTPControlChannel.execute (
> > > > > > > FTPControlChannel.java:328)
> > > > > > > at org.globus.ftp.FTPClient.deleteFile(FTPClient.java:
> > > > 253)
> > > > > > > at
> > > > org.globus.transfer.reliable.service.DeleteClient.delete(
> > > > > > > DeleteClient.java:189)
> > > > > > > at
> > > > org.globus.transfer.reliable.service.TransferWork.run (
> > > > > > > TransferWork.java:688)
> > > > > > > at org.globus.wsrf.impl.work.WorkManagerImpl
> > > > $WorkWrapper.run (
> > > > > > > WorkManagerImpl.java:355)
> > > > > > > at java.lang.Thread.run(Thread.java:595)
> > > > > > > 2007-07-23 11:10:58,628 ERROR
> > > > > > > service.TransferWork[WorkThread-40,run:724] Terminal transfer
> > > > error:
> > > > > > > Error deleting a file
> > > > > > > "/opt/globus/523.err" [Caused by: Server refused performing
> > > the
> > > > > >
> > > > > > request.
> > > > > >
> > > > > > > Custom message: Server refused deleting file (error code 1)
> > > > [Nested
> > > > > > > exception message: Custom message: Unexpected reply: 500-
> > > > Command fai
> > > > > >
> > > > > > led :
> > > > > > > System error in unlink: No such file or directory
> > > > > > > 500-A system call failed: No such file or directory
> > > > > > > 500 End.]]
> > > > > > > Error deleting a file
> > > > > > > "/opt/globus/523.err"
> > > > > >
> > > > > > It seems that you haven't a $GLOBUS_USER_HOME, this would
> > > > explain some
> > > > > > problems. In the rsl file you have:
> > > > > >
> > > > > > ${GLOBUS_USER_HOME}/523.err 73 as stderr, and I don't believe
> > > that
> > > > > > $GLOBUS_USER_HOME is /opt/globus for a simple grid user... Do
> > > > you see any
> > > > > > of
> > > > > > the poststage datasets, if you poststage them?
> > > > >
> > > > > Actually, i'm mapped to the remote globus user so that points to
> > > > > /opt/globus/ . And as i said, 523.sh appears in /opt/globus some
> > > > staging
> > > > > seems to work...
> > > > >
> > > > > What do you mean by poststage datasets?
> > > > >
> > > >
> > > > At the end of my rsl dataset I've a passage as:
> > > >
> > > > #############################cut here######################
> > > > <fileStageOut>
> > > > <!-- stage out stdout -->
> > > > <transfer>
> > > >
> > > > <sourceUrl>file:///${GLOBUS_USER_HOME}/523.out</sourceUrl>
> > > >
> > > > <destinationUrl>gsiftp://globus.submit.host/store/GEO600/tasks/
> > > > 523.out</destinationUrl>
> > > > </transfer>
> > > > <!-- stage out stderr -->
> > > > <transfer>
> > > >
> > > > <sourceUrl>file:///${GLOBUS_USER_HOME}/523.err</sourceUrl>
> > > >
> > > > <destinationUrl>gsiftp://globus.submit.host/store/GEO600/tasks/
> > > > 523.err</destinationUrl>
> > > > </transfer>
> > > > <!-- stage out log -->
> > > > <transfer>
> > > >
> > > > <sourceUrl>file:///${GLOBUS_USER_HOME}/GEO600/tasks/523.log</
> > > > sourceUrl>
> > > >
> > > > <destinationUrl>gsiftp://globus.submit.host/store/GEO600/tasks/
> > > > 523.log</destinationUrl>
> > > > </transfer>
> > > > <!-- stage out task results -->
> > > > <transfer>
> > > >
> > > > <sourceUrl>file:///${GLOBUS_USER_HOME}/GEO600/tasks/523.tar</
> > > > sourceUrl>
> > > >
> > > > <destinationUrl>gsiftp://globus.submit.host/store/GEO600/tasks/
> > > > 523.tar</destinationUrl>
> > > > </transfer>
> > > > </fileStageOut>
> > > > #############################cut here######################
> > > >
> > > > Here the postaging or filestageout is performed. The other section
> > > > (cleanup)
> > > > describes, which datasets should be deleted.
> > > >
> > > > Okay, if you entering to SGE, you are getting another environment.
> > > > Your output
> > > > files will be probabaly located on worker node. I believe that SGE
> > > > can't
> > > > connect to /opt/globus. So the ouptut files won't be written!
> > > > Contact your
> > > > system administrator, ask him, which filesystem directories can be
> > > > accessed
> > > > by SGE jobs...
> > > >
> > > > > Is there any logfile or something else i can do to have some
> > > > verbose
> > > > > debugging informations? Where is the "submit script" that fails?
> > > > >
> > > > >
> > > >
> > > > Option "-debug" at the globusrun-ws call
> > > >
> > > > Cheers
> > > >
> > > > Alexander
> > > >
> > > > <output.log>
> > >
> > >
> >
>