Well. I've been taking a look in the file:
/opt/globus/lib/perl/Globus/GRAM/JobManager/sge.pm to search where was the
GRAM_SCRIPT_ERROR:24 raised.
I found that it is on that piece of code (sge.pm attached), line 422:
#####
# Determining job request type.
#
print("Determining job type");
print(" Job is of type " . $description->jobtype());
if($description->jobtype() eq "mpi" ||
$description->jobtype() eq "multiple")
{
#####
# Check if RSL attribute parallel_environment is provided
#
if($description->parallel_environment())
{
$mpi_pe = $description->parallel_environment();
}
if(!$mpi_pe || $mpi_pe eq "NONE"){
print("ERROR: Parallel Environment (PE) failure!");
print(" MPI/multiple job was submitted, but no PE set");
print(" by neither user nor administrator");
return Globus::GRAM::Error::INVALID_SCRIPT_REPLY;
}
else
{
print(" PE is $mpi_pe");
$sge_job_script->print("#\$ -pe $mpi_pe "
. $description->count() . "\n");
}
So, here is the output related to that piece of code:
Determining job type
Job is of type multiple
ERROR: Parallel Environment (PE) failure!
MPI/multiple job was submitted, but no PE set by neither user nor
administrator
GRAM_SCRIPT_ERROR:24
So, 2 questions come in my mind:
* it says that i submitted a multiple/mpi job. But "-c /bin/hostname" is
not, isn't it?
* what must i do to set that "Parallele Environment" ?
Cheers,
Francois.
On 7/25/07, Francois Hornoy <[EMAIL PROTECTED]> wrote:
Ok. I have some debugging informations now. So i repeat the context.
Here is the basic command: globusrun-ws -submit -F
https://MyIP:8443/wsrf/services/ManagedJobFactoryService
-c /bin/hostname
Works fine. If i add the "-Ft SGE" option, i get an error.
On the client side (output of that command above) is:
[EMAIL PROTECTED]:~$ globusrun-ws -submit -F
https://MyIP:8443/wsrf/services/ManagedJobFactoryService -Ft SGE -c
/bin/hostname
Submitting job...Done.
Job ID: uuid:9007c47c-3aa0-11dc-86fc-0017f23158ca
Termination time: 07/26/2007 11:16 GMT
Current job state: Failed
Destroying job...Done.
globusrun-ws: Job failed: Internal fault occurred while running the submit
script.
[EMAIL PROTECTED]:~$
On the server side, i attached the container output. It's a bit long, i
think the relevant lines are here (around line 828 in the attached file):
2007-07-25 13:17:01,862 DEBUG exec.JobManagerScript [Thread-18,run:208]
Executing command:
/usr/bin/sudo -H -u fhornoy -S
/opt/globus/libexec/globus-gridmap-and-execute -g
/etc/grid-security/grid-mapfile /opt/globus/libexec/globus-
job-manager-script.pl -m sge -f /opt/globus/tmp/gram_job_mgr53582.tmp -\
c submit
2007-07-25 13:17:02,056 DEBUG exec.JobManagerScript [Thread-18,run:225]
first line: GRAM_SCRIPT_ERROR:24
2007-07-25 13:17:02,057 DEBUG exec.JobManagerScript [Thread-18,run:228]
Read line: GRAM_SCRIPT_ERROR:24
2007-07-25 13:17:02,059 DEBUG exec.JobManagerScript [Thread-18,run:335]
failure message: null
2007-07-25 13:17:02,059 DEBUG exec.JobManagerScript[Thread-18,setDone:345]
script is done, setting done flag
2007-07-25 13:17:02,060 DEBUG
exec.StateMachine[RunQueueThread_0,processSubmitState:1105] Done waiting for
submit script
2007-07-25 13:17:02,060 DEBUG
exec.StateMachine[RunQueueThread_0,processSubmitState:1129] script return code:
24
2007-07-25 13:17:02,060 DEBUG
exec.StateMachine[RunQueueThread_0,processSubmitState:1134] script return code
means error!
2007-07-25 13:17:02,060 DEBUG
exec.StateMachine[RunQueueThread_0,createFaultFromErrorCode:3027] Creating
fault from error
code 24
2007-07-25 13:17:02,066 DEBUG utils.FaultUtils[RunQueueThread_0,makeFault:460]
Fault Class: class
org.globus.exec.generated.InternalFaultType
2007-07-25 13:17:02,066 DEBUG utils.FaultUtils[RunQueueThread_0,makeFault:461]
Resource Key:
{http://www.globus.org/namespaces/2004/10/gram/job}ResourceID=9007c47c-3aa0-11dc-86fc-0017f23158ca
<http://www.globus.org/namespaces/2004/10/gram/job%7DResourceID=9007c47c-3aa0-11dc-86fc-0017f23158ca>
2007-07-25 13:17:02,066 DEBUG utils.FaultUtils[RunQueueThread_0,makeFault:462]
Description: Internal fault occurred while
running the submit script.
2007-07-25 13:17:02,067 DEBUG utils.FaultUtils[RunQueueThread_0,makeFault:463]
Cause: null
2007-07-25 13:17:02,067 DEBUG utils.FaultUtils[RunQueueThread_0,makeFault:464]
State when failure occurred Unsubmitted
2007-07-25 13:17:02,067 DEBUG utils.FaultUtils[RunQueueThread_0,makeFault:466]
Script Command: submit
2007-07-25 13:17:02,067 DEBUG utils.FaultUtils[RunQueueThread_0,makeFault:467]
GT2 Error Code: 24
2007-07-25 13:17:02,072 DEBUG utils.FaultUtils[RunQueueThread_0,makeFault:519]
Script Command: submit
2007-07-25 13:17:02,072 DEBUG
ManagedJobResourceImpl.9007c47c-3aa0-11dc-86fc-0017f23158ca[RunQueueThread_0,setFault:346]
fault element name: InternalFaultType
2007-07-25 13:17:02,072 DEBUG
ManagedJobResourceImpl.9007c47c-3aa0-11dc-86fc-0017f23158ca[RunQueueThread_0,setFault:350]
fault element name: InternalFault
2007-07-25 13:17:02,072 DEBUG
ManagedJobResourceImpl.9007c47c-3aa0-11dc-86fc-0017f23158ca[RunQueueThread_0,setFault:353]
fault element name: internalFault
So, i did not find much on google about error 24, and it's not very
explicit.
Cheers,
Francois.
On 7/23/07, alexander.beck-ratzka <[EMAIL PROTECTED]>
wrote:
>
> On Monday 23 July 2007 14:19, Francois Hornoy wrote:
> > On 7/23/07, alexander.beck-ratzka <[EMAIL PROTECTED]>
> wrote:
> > > On Monday 23 July 2007 11:18, Francois Hornoy wrote:
> > > > Hi Alexander,
> > > >
> > > > I tried a simple example based on yours. It just stageIn a file,
> > > > "/bin/cat" it (that's the job), and i stageOut the .out and .err
> files.
> > > >
> > > > I keep having this error (see end of mail).
> > > >
> > > > If i watch globusrun-ws -status -j job.id nad if i watch the
> > > > "Execution Host", i can see that the StageIn step is good, the
> file
> > > > 523.sh is well transferred. But then, it crashes.
> > > >
> > > > Of course, if i do exactly the same thing without "-Ft SGE", it
> works
> > > > perfectly.
> > > >
> > > > Cheers,
> > > > Francois.
> > > >
> > > > 2007-07-23 11:10:53,427 INFO
> > > > exec.StateMachine[RunQueueThread_5,logJobAccepted:3193] Job
> > > > 9cd21054-38fc-11dc-b055-0017f23158ca accepted for local user
> 'globus'
> > > > 2007-07-23 11:10:58,370 ERROR
> > > > service.TransferWork[WorkThread-39,run:724] Terminal transfer
> error:
> > > > Error deleting a file
> > > > "/opt/globus/523.out" [Caused by: Server refused performing the
> > >
> > > request.
> > >
> > > > Custom message: Server refused deleting file (error code 1)
> [Nested
> > > > exception message: Custom message: Unexpected reply: 500-Command
> fai
> > >
> > > led :
> > > > System error in unlink: No such file or directory
> > > > 500-A system call failed: No such file or directory
> > > > 500 End.]]
> > > > Error deleting a file
> > > > "/opt/globus/523.out"
> > > > . Caused by
> > > > org.globus.ftp.exception.ServerException: Server refused
> performing the
> > > > request. Custom message: Server refused deleting file (error code
> 1)
> > > > [Nested exception message: Custom message: Unexpected reply:
> 500-Comm
> > >
> > > and
> > >
> > > > failed : System error in unlink: No such file or directory
> > > > 500-A system call failed: No such file or directory
> > > > 500 End.]. Nested exception is
> > > > org.globus.ftp.exception.UnexpectedReplyCodeException: Custom
> message:
> > > > Unexpected reply: 500-Command failed : System error in unlink: No
> such
> > >
> > > file
> > >
> > > > or directory
> > > > 500-A system call failed: No such file or directory
> > > > 500 End.
> > > > at org.globus.ftp.vanilla.FTPControlChannel.execute (
> > > > FTPControlChannel.java:328)
> > > > at org.globus.ftp.FTPClient.deleteFile(FTPClient.java:253)
> > > > at
> org.globus.transfer.reliable.service.DeleteClient.delete(
> > > > DeleteClient.java:189)
> > > > at org.globus.transfer.reliable.service.TransferWork.run(
> > > > TransferWork.java:688)
> > > > at
> org.globus.wsrf.impl.work.WorkManagerImpl$WorkWrapper.run (
> > > > WorkManagerImpl.java:355)
> > > > at java.lang.Thread.run(Thread.java:595)
> > > > 2007-07-23 11:10:58,628 ERROR
> > > > service.TransferWork[WorkThread-40,run:724] Terminal transfer
> error:
> > > > Error deleting a file
> > > > "/opt/globus/523.err" [Caused by: Server refused performing the
> > >
> > > request.
> > >
> > > > Custom message: Server refused deleting file (error code 1)
> [Nested
> > > > exception message: Custom message: Unexpected reply: 500-Command
> fai
> > >
> > > led :
> > > > System error in unlink: No such file or directory
> > > > 500-A system call failed: No such file or directory
> > > > 500 End.]]
> > > > Error deleting a file
> > > > "/opt/globus/523.err"
> > >
> > > It seems that you haven't a $GLOBUS_USER_HOME, this would explain
> some
> > > problems. In the rsl file you have:
> > >
> > > ${GLOBUS_USER_HOME}/523.err 73 as stderr, and I don't believe that
> > > $GLOBUS_USER_HOME is /opt/globus for a simple grid user... Do you
> see any
> > > of
> > > the poststage datasets, if you poststage them?
> >
> > Actually, i'm mapped to the remote globus user so that points to
> > /opt/globus/ . And as i said, 523.sh appears in /opt/globus some
> staging
> > seems to work...
> >
> > What do you mean by poststage datasets?
> >
>
> At the end of my rsl dataset I've a passage as:
>
> #############################cut here######################
> <fileStageOut>
> <!-- stage out stdout -->
> <transfer>
>
> <sourceUrl>file:///${GLOBUS_USER_HOME}/523.out</sourceUrl>
>
>
<destinationUrl>gsiftp://globus.submit.host/store/GEO600/tasks/523.out</destinationUrl>
>
> </transfer>
> <!-- stage out stderr -->
> <transfer>
>
> <sourceUrl>file:///${GLOBUS_USER_HOME}/523.err</sourceUrl>
>
>
<destinationUrl>gsiftp://globus.submit.host/store/GEO600/tasks/523.err</destinationUrl>
>
> </transfer>
> <!-- stage out log -->
> <transfer>
>
> <sourceUrl>file:///${GLOBUS_USER_HOME}/GEO600/tasks/523.log</sourceUrl>
>
>
<destinationUrl>gsiftp://globus.submit.host/store/GEO600/tasks/523.log</destinationUrl>
>
> </transfer>
> <!-- stage out task results -->
> <transfer>
>
> <sourceUrl>file:///${GLOBUS_USER_HOME}/GEO600/tasks/523.tar</sourceUrl>
>
>
>
<destinationUrl>gsiftp://globus.submit.host/store/GEO600/tasks/523.tar</destinationUrl>
> </transfer>
> </fileStageOut>
> #############################cut here######################
>
> Here the postaging or filestageout is performed. The other section
> (cleanup)
> describes, which datasets should be deleted.
>
> Okay, if you entering to SGE, you are getting another environment. Your
> output
> files will be probabaly located on worker node. I believe that SGE can't
> connect to /opt/globus. So the ouptut files won't be written! Contact
> your
> system administrator, ask him, which filesystem directories can be
> accessed
> by SGE jobs...
>
> > Is there any logfile or something else i can do to have some verbose
> > debugging informations? Where is the "submit script" that fails?
> >
> >
>
> Option "-debug" at the globusrun-ws call
>
> Cheers
>
> Alexander
>