As the SGE module isn't ours, I don't have any reason why it would be setting the jobtype to multiple here. If I were you, I would just go into the sge.pm file and make it so it didn't set my jobtype to multiple unless I asked it to. :-)

Charles

On Jul 27, 2007, at 11:09 AM, Francois Hornoy wrote:


 One more mail to give some news. :)

I did what Stuart said: i got the Perl description. I put it in a file, and then launched manually the command.

 Here is the Perl description:

$description = {
    directory => [ '/home/fhornoy' ],
    condoros => [ 'LINUX' ],
    condorarch => [ 'INTEL' ],
    stderr => [ '/dev/null' ],
environment => [ [ 'GLOBUS_LOCATION', '/opt/globus' ], [ 'X509_CERT_DIR', '/etc/grid-security/certificates' ], [ 'X509_USER_PROXY', '' ], [ 'X509_USER_CERT', '' ], [ 'X509_USER_KEY', '' ], [ 'HOME', '/home/f\ hornoy' ], [ 'LOGNAME', 'fhornoy' ], [ 'SCRATCH_DIRECTORY', '/home/ fhornoy/.globus/scratch' ], [ 'JAVA_HOME', '/usr/java/jdk1.5.0_07/ jre' ], [ 'GLOBUS_GRAM_JOB_HANDLE', ' https://193.48.145.106:8443/ wsrf/services\ /ManagedExecutableJobService? 1d9988b8-3c20-11dc-908d-0017f23158ca' ], ],
    xmlextensions => [ '1' ],
    executable => [ '/bin/hostname' ],
factoryendpoint => [ 'Address: https://193.48.145.106:8443/wsrf/ services/ManagedJobFactoryService
Reference property[0]:
<ns5:ResourceID ns04:type="ns05:string" xmlns:ns04="http:// www.w3.org/2001/XMLSchema-instance" xmlns:ns05=" http://www.w3.org/ 2001/XMLSchema" xmlns:ns5="http://www.globus.org/namespaces/2004/10/ gram/job">SGE</ns5\
:ResourceID>
' ],
    stdin => [ '/dev/null' ],
jobdir => [ '/home/fhornoy/.globus/ 1d9988b8-3c20-11dc-908d-0017f23158ca' ],
    jobtype => [ 'multiple' ],
    stdout => [ '/dev/null' ],
    count => [ '1' ],
};


 Here is the command i launch:
/usr/bin/sudo -H -u fhornoy -S /opt/globus/libexec/globus-gridmap- and-execute -g /etc/grid-security/grid-mapfile /opt/globus/libexec/ globus- job-manager-script.pl -m sge -f description.txt -c submit

And it fails on GRAM_ERROR:24. But (remember my previous mails) it's normal because in the perl description, "jobtyp" is set to "multiple" and it fails because my "PE environment is not set".

So in the perl description, i change "multiple" to "single" and it works fine.

So: is it normal that a "-c /bin/hostname" is interpreted as a "multiple" job? If yes, how to set up my PE environment please. If no, i'll try to find what's going wrong. :)

 Cheers,
 Francois.



On 7/26/07, Francois Hornoy <[EMAIL PROTECTED]> wrote:

  Hi,

As said in a previous mail, i've identified the piece of perl code that fails. In the debugging informations from the container, i got the line that is execute:

/usr/bin/sudo -H -u fhornoy -S /opt/globus/libexec/globus-gridmap- and-execute -g /etc/grid-security/grid-mapfile /opt/globus/libexec/ globus- job-manager-script.pl -m sge -f /opt/globus/tmp/ gram_job_mgr41857.tmp -c submit

And perl logs contain (for the relevant part, pasted at the end of mail):
Determining job type
 Job is of type multiple
ERROR: Parallel Environment (PE) failure!
MPI/multiple job was submitted, but no PE set by neither user nor administrator
GRAM_SCRIPT_ERROR:24

So why a "-c /bin/hostname" is interpreted as a multiple job. I mean: is this normal?
And if it is, how to "set that PE" ?


The piece of code: about line 422 of /opt/globus/lib/perl/Globus/ GRAM/JobManager/sge.pm :

    #####
    # Determining job request type.
    #
    print("Determining job type");
    print("  Job is of type " . $description->jobtype());
    if($description->jobtype() eq "mpi" ||
       $description->jobtype() eq "multiple")
    {
        #####
        # Check if RSL attribute parallel_environment is provided
        #
        if($description->parallel_environment())
        {
            $mpi_pe = $description->parallel_environment();
        }

        if(!$mpi_pe || $mpi_pe eq "NONE"){
          print("ERROR: Parallel Environment (PE) failure!");
            print("  MPI/multiple job was submitted, but no PE set");
            print("  by neither user nor administrator");
            return Globus::GRAM::Error::INVALID_SCRIPT_REPLY;
        }
        else
        {
            print("  PE is $mpi_pe");
            $sge_job_script->print("#\$ -pe $mpi_pe "
                                   . $description->count() . "\n");
        }


 Cheers,
 Francois.



On 7/26/07, Francois Hornoy <[EMAIL PROTECTED]> wrote:

 Hi,

On 7/25/07, Stuart Martin <[EMAIL PROTECTED]> wrote:
You can use the GRAM2 error codes to look up the error.
http://www-unix.globus.org/toolkit/docs/4.0/execution/prewsgram/user-
index.html#s-gram-user-errorcodes

24 GLOBUS_GRAM_PROTOCOL_ERROR_INVALID_SCRIPT_REPLY the job manager
detected an invalid script response

There is some doc for this problem here:
http://www-unix.globus.org/toolkit/docs/4.0/execution/wsgram/user-
index.html#s-wsgram-user-troubleshooting
Find the heading for "The job manager detected an invalid script
response"

Yeah i already read that before posting my mail. I tried putting a umask of 0000 for the local user we are mapped to. That did not solve my problem (or maybe that 0000 umask thing is wrong?).


If there is nothing obvious, then there is a section here on
debugging script executions, where you save out the contents of the
perl job description used by the scripts and run the perl submission
command by hand:
        http://www-unix.globus.org/toolkit/docs/4.0/execution/wsgram/
developer-index.html#id2565465


Ok, i will try that as soon as the globus website is working fine. Thanks.

 Francois.


-Stu

On Jul 25, 2007, at Jul 25, 6:27 AM, Francois Hornoy wrote:

>
> Ok. I have some debugging informations now. So i repeat the context.
>
>  Here is the basic command: globusrun-ws -submit -F https://MyIP:
> 8443/wsrf/services/ManagedJobFactoryService -c /bin/hostname
>
>  Works fine. If i add the "-Ft SGE" option, i get an error.
>
>  On the client side (output of that command above) is:
>
> [EMAIL PROTECTED]:~$ globusrun-ws -submit -F https://MyIP:8443/wsrf/
> services/ManagedJobFactoryService -Ft SGE -c /bin/hostname
> Submitting job...Done.
> Job ID: uuid:9007c47c-3aa0-11dc-86fc-0017f23158ca
> Termination time: 07/26/2007 11:16 GMT
> Current job state: Failed
> Destroying job...Done.
> globusrun-ws: Job failed: Internal fault occurred while running the
> submit script.
> [EMAIL PROTECTED]:~$
>
>  On the server side, i attached the container output. It's a bit
> long, i think the relevant lines are here (around line 828 in the
> attached file):
>
> 2007-07-25 13:17:01,862 DEBUG exec.JobManagerScript [Thread-18,run:
> 208] Executing command:
> /usr/bin/sudo -H -u fhornoy -S /opt/globus/libexec/globus-gridmap-
> and-execute -g /etc/grid-security/grid-mapfile /opt/globus/libexec/
> globus- job-manager-script.pl -m sge -f /opt/globus/tmp/
> gram_job_mgr53582.tmp -\
> c submit
> 2007-07-25 13:17:02,056 DEBUG exec.JobManagerScript [Thread-18,run:
> 225] first line: GRAM_SCRIPT_ERROR:24
> 2007-07-25 13:17:02,057 DEBUG exec.JobManagerScript [Thread-18,run:
> 228] Read line: GRAM_SCRIPT_ERROR:24
> 2007-07-25 13:17:02,059 DEBUG exec.JobManagerScript [Thread-18,run:
> 335] failure message: null
> 2007-07-25 13:17:02,059 DEBUG exec.JobManagerScript
> [Thread-18,setDone:345] script is done, setting done flag
> 2007-07-25 13:17:02,060 DEBUG exec.StateMachine
> [RunQueueThread_0,processSubmitState:1105] Done waiting for submit
> script
> 2007-07-25 13:17:02,060 DEBUG exec.StateMachine
> [RunQueueThread_0,processSubmitState:1129] script return code: 24
> 2007-07-25 13:17:02,060 DEBUG exec.StateMachine
> [RunQueueThread_0,processSubmitState:1134] script return code means
> error!
> 2007-07-25 13:17:02,060 DEBUG exec.StateMachine
> [RunQueueThread_0,createFaultFromErrorCode:3027] Creating fault
> from error code 24
> 2007-07-25 13:17:02,066 DEBUG utils.FaultUtils
> [RunQueueThread_0,makeFault:460] Fault Class: class
> org.globus.exec.generated.InternalFaultType
> 2007-07-25 13:17:02,066 DEBUG utils.FaultUtils
> [RunQueueThread_0,makeFault:461] Resource Key: {http://
> www.globus.org/namespaces/2004/10/gram/job}
> ResourceID=9007c47c-3aa0-11dc-86fc-0017f23158ca
> 2007-07-25 13:17:02,066 DEBUG utils.FaultUtils
> [RunQueueThread_0,makeFault:462] Description: Internal fault
> occurred while running the submit script.
> 2007-07-25 13:17:02,067 DEBUG utils.FaultUtils
> [RunQueueThread_0,makeFault:463] Cause: null
> 2007-07-25 13:17:02,067 DEBUG utils.FaultUtils
> [RunQueueThread_0,makeFault:464] State when failure occurred
> Unsubmitted
> 2007-07-25 13:17:02,067 DEBUG utils.FaultUtils
> [RunQueueThread_0,makeFault:466] Script Command: submit
> 2007-07-25 13:17:02,067 DEBUG utils.FaultUtils
> [RunQueueThread_0,makeFault:467] GT2 Error Code: 24
> 2007-07-25 13:17:02,072 DEBUG utils.FaultUtils
> [RunQueueThread_0,makeFault:519] Script Command: submit
> 2007-07-25 13:17:02,072 DEBUG ManagedJobResourceImpl.
> 9007c47c-3aa0-11dc-86fc-0017f23158ca [RunQueueThread_0,setFault:
> 346] fault element name: InternalFaultType
> 2007-07-25 13:17:02,072 DEBUG ManagedJobResourceImpl.
> 9007c47c-3aa0-11dc-86fc-0017f23158ca [RunQueueThread_0,setFault:
> 350] fault element name: InternalFault
> 2007-07-25 13:17:02,072 DEBUG ManagedJobResourceImpl.
> 9007c47c-3aa0-11dc-86fc-0017f23158ca [RunQueueThread_0,setFault:
> 353] fault element name: internalFault
>
>
>  So, i did not find much on google about error 24, and it's not
> very explicit.
>
>
>  Cheers,
>  Francois.
>
>
> On 7/23/07, alexander.beck-ratzka <alexander.beck-
> [EMAIL PROTECTED]> wrote: On Monday 23 July 2007 14:19, Francois
> Hornoy wrote:
> > On 7/23/07, alexander.beck-ratzka <alexander.beck-
> [EMAIL PROTECTED] > wrote:
> > > On Monday 23 July 2007 11:18, Francois Hornoy wrote:
> > > >  Hi Alexander,
> > > >
> > > >  I tried a simple example based on yours. It just stageIn a
> file,
> > > > "/bin/cat" it (that's the job), and i stageOut the .out
> and .err files.
> > > >
> > > >  I keep having this error (see end of mail).
> > > >
> > > >  If i watch globusrun-ws -status -j job.id nad if i watch the
> > > > "Execution Host", i can see that the StageIn step is good,
> the file
> > > > 523.sh is well transferred. But then, it crashes.
> > > >
> > > >  Of course, if i do exactly the same thing without "-Ft SGE",
> it works
> > > > perfectly.
> > > >
> > > >   Cheers,
> > > >   Francois.
> > > >
> > > > 2007-07-23 11:10:53,427 INFO
> > > > exec.StateMachine[RunQueueThread_5,logJobAccepted:3193] Job
> > > > 9cd21054-38fc-11dc-b055-0017f23158ca accepted for local user
> 'globus'
> > > > 2007-07-23 11:10:58,370 ERROR
> > > > service.TransferWork[WorkThread-39,run:724] Terminal transfer
> error:
> > > > Error deleting a file
> > > > "/opt/globus/523.out" [Caused by: Server refused performing the
> > >
> > > request.
> > >
> > > > Custom message: Server refused deleting file (error code 1)
> [Nested
> > > > exception message:  Custom message: Unexpected reply: 500-
> Command fai
> > >
> > > led :
> > > > System error in unlink: No such file or directory
> > > > 500-A system call failed: No such file or directory
> > > > 500 End.]]
> > > > Error deleting a file
> > > >  "/opt/globus/523.out"
> > > > . Caused by
> > > > org.globus.ftp.exception.ServerException: Server refused
> performing the
> > > > request. Custom message: Server refused deleting file (error
> code 1)
> > > > [Nested exception message:  Custom message: Unexpected reply:
> 500-Comm
> > >
> > > and
> > >
> > > > failed : System error in unlink: No such file or directory
> > > > 500-A system call failed: No such file or directory
> > > > 500 End.].  Nested exception is
> > > > org.globus.ftp.exception.UnexpectedReplyCodeException:
> Custom message:
> > > > Unexpected reply: 500-Command failed : System error in
> unlink: No such
> > >
> > > file
> > >
> > > > or directory
> > > > 500-A system call failed: No such file or directory
> > > > 500 End.
> > > >         at org.globus.ftp.vanilla.FTPControlChannel.execute (
> > > > FTPControlChannel.java:328)
> > > > at org.globus.ftp.FTPClient.deleteFile (FTPClient.java :
> 253)
> > > >         at
> org.globus.transfer.reliable.service.DeleteClient.delete(
> > > > DeleteClient.java:189)
> > > >         at
> org.globus.transfer.reliable.service.TransferWork.run (
> > > > TransferWork.java:688)
> > > >         at org.globus.wsrf.impl.work.WorkManagerImpl
> $WorkWrapper.run (
> > > > WorkManagerImpl.java:355)
> > > >         at java.lang.Thread.run(Thread.java:595)
> > > > 2007-07-23 11:10:58,628 ERROR
> > > > service.TransferWork[WorkThread-40,run:724] Terminal transfer
> error:
> > > > Error deleting a file
> > > > "/opt/globus/523.err" [Caused by: Server refused performing the
> > >
> > > request.
> > >
> > > > Custom message: Server refused deleting file (error code 1)
> [Nested
> > > > exception message:  Custom message: Unexpected reply: 500-
> Command fai
> > >
> > > led :
> > > > System error in unlink: No such file or directory
> > > > 500-A system call failed: No such file or directory
> > > > 500 End.]]
> > > > Error deleting a file
> > > >  "/opt/globus/523.err"
> > >
> > > It seems that you haven't a $GLOBUS_USER_HOME, this would
> explain some
> > > problems. In the rsl file you have:
> > >
> > > ${GLOBUS_USER_HOME}/523.err 73 as stderr, and I don't believe that
> > > $GLOBUS_USER_HOME is /opt/globus for a simple grid user... Do
> you see any
> > > of
> > > the poststage datasets, if you poststage them?
> >
> >  Actually, i'm mapped to the remote globus user so that points to
> > /opt/globus/ . And as i said, 523.sh appears in /opt/globus some
> staging
> > seems to work...
> >
> >  What do you mean by poststage datasets?
> >
>
> At the end of my rsl dataset I've a passage as:
>
> #############################cut here######################
>         <fileStageOut>
>                 <!-- stage out stdout -->
>                 <transfer>
>
> <sourceUrl>file:///${GLOBUS_USER_HOME}/523.out</sourceUrl>
>
> <destinationUrl>gsiftp://globus.submit.host/store/GEO600/tasks/
> 523.out</destinationUrl>
>                 </transfer>
>                 <!-- stage out stderr -->
>                 <transfer>
>
> <sourceUrl>file:///${GLOBUS_USER_HOME}/523.err</sourceUrl>
>
> <destinationUrl>gsiftp://globus.submit.host/store/GEO600/tasks/
> 523.err</destinationUrl>
>                 </transfer>
>                 <!-- stage out log -->
>                 <transfer>
>
> <sourceUrl>file:///${GLOBUS_USER_HOME}/GEO600/tasks/523.log</
> sourceUrl>
>
> <destinationUrl>gsiftp://globus.submit.host/store/GEO600/tasks/
> 523.log</destinationUrl>
>                 </transfer>
>                 <!-- stage out task results -->
>                 <transfer>
>
> <sourceUrl>file:///${GLOBUS_USER_HOME}/GEO600/tasks/523.tar</
> sourceUrl>
>
> <destinationUrl>gsiftp://globus.submit.host/store/GEO600/tasks/
> 523.tar</destinationUrl>
>                 </transfer>
>         </fileStageOut>
> #############################cut here######################
>
> Here the postaging or filestageout is performed. The other section
> (cleanup)
> describes, which datasets should be deleted.
>
> Okay, if you entering to SGE, you are getting another environment.
> Your output
> files will be probabaly located on worker node. I believe that SGE
> can't
> connect to /opt/globus. So the ouptut files won't be written!
> Contact your
> system administrator, ask him, which filesystem directories can be
> accessed
> by SGE jobs...
>
> >  Is there any logfile or something else i can do to have some
> verbose
> > debugging informations? Where is the "submit script" that fails?
> >
> >
>
> Option "-debug" at the globusrun-ws call
>
> Cheers
>
> Alexander
>
> <output.log>





Reply via email to