Victor: It's possible it's still a quoting issue, you could try quoting like: "-n 12"
Also, you could try putting a -l or -s as appropriate in front of /sw/altd/bin/aprun http://www.globus.org/toolkit/docs/5.0/5.0.2/execution/gram5/pi/#gram5-cmd-globus-job-run If that doesn't work, you might try putting the arguments in as an RSL clause with -X Otherwise, I'm not sure. I'm cc:ing [email protected] so that the GRAM folks will see it too. Eric ----- Original Message ----- > That doesn't work. There is no such program as “/sw/altd/bin/aprun –n > 12”. J > > > > vic...@krakenpf8(XT5):~/globustests> globus-job-run > grid.nics.utk.edu:2119/jobmanager-pbs -count 12 -m 1 -p UT-SUPPORT -d > /lustre/scratch/victor -stdout /lustre/scratch/victor/globusjobrun.out > -stderr /lustre/scratch/victor/globusjobrun.err "/sw/altd/bin/aprun -n > 12" /lustre/scratch/victor/test1 > > GRAM Job failed because the executable does not exist (error code 5) > > > > Also, why does doing the job via globusrun generate the small concise > job and when done with globus-job-run it generates a large two part > job that wants to ssh to a list of nodes? Where is the code that > generates the globus-job-run job script(s) as I will need to fix that > for sure. I see where the argument parsing is done, but what I don’t > get is why does the “-n” get removed when used with globus-job-run but > parses/works ok when I do globusrun? Strange. > > > > -Victor > > > > -----Original Message----- > From: [email protected] [mailto:[email protected]] > Sent: Tuesday, December 14, 2010 2:33 PM > To: Hazlewood, Victor Gene > Cc: JP Navarro; gig-pack > Subject: Re: globusrun and globus-job-run difference > > > > Victor, > > > > My guess is that globus-job-run may be misinterpreting the "-n" as a > option that it thinks it needs to interpret, as opposed to an argument > to aprun. If this is the case, it is possible that some creative > quoting may work. You could try something like: > > > > globus-job-run grid.nics.utk.edu:2119/jobmanager-pbs -count 12 -m 1 -p > UT-SUPPORT -d /lustre/scratch/victor -stdout > /lustre/scratch/victor/globusjobrun.out -stderr > /lustre/scratch/victor/globusjobrun.err "/sw/altd/bin/aprun -n 12" > /lustre/scratch/victor/test1 > > > > > > Eric > > > > ----- Original Message ----- > > > Eric, > > > > > > > > > > > > I am testing the different ways to invoke GRAM5 remote execution to > > > make sure things are consistent and have come across the following > > > problem. What do I need to do to modify the GRAM5 software so that > > > globusrun and globus-job-run commands for the same end user > > > programming result generates the same or, at least, nearly the same > > > end user job? I am not sure why or where the globus-job-run command > > > generates such a strange batch job with embedded “ssh” commands and > > > then strips the arguments different than when using globusrun. Any > > > pointers would be helpful. > > > > > > > > > > > > When using globusrun I get the following: > > > > > > > > > > > > Kraken$ globusrun -o -r grid.nics.utk.edu:2119/jobmanager-pbs > > > '&(executable=/sw/altd/bin/aprun)(arguments="-n 12" > > > "/lustre/scratch/victor/test1")(jobType=single)(count="12")(maxtime=1)(directory='/lustre/scratch/victor')(save_job_description="yes")(emailonabort=yes)(emailonexecution=yes)(emailontermination=yes)(email_address="[email protected]")(project=UT-SUPPORT)(stdout='/lustre/scratch/victor/globusjob.out')(stderr='/lustre/scratch/victor/globusjob.err')(queue=batch)' > > > > > > > > > > > > This generates and submits the following successful batch job: > > > > > > (though I don’t like that it automatically specifies ‘< /dev/null’ > > as > > > stdin if an stdin RSL is not specified) > > > > > > > > > > > > #! /bin/sh > > > > > > # PBS batch job script built by Globus job manager > > > > > > # > > > > > > #PBS -S /bin/sh > > > > > > #PBS -M [email protected] > > > > > > #PBS -m abe > > > > > > #PBS -A UT-SUPPORT > > > > > > #PBS -l walltime=1:00 > > > > > > #PBS -o /lustre/scratch/victor/globusjob.out > > > > > > #PBS -e /lustre/scratch/victor/globusjob.err > > > > > > #PBS -l size=24 > > > > > > X509_USER_PROXY="/nics/a/home/victor/.globus/job/grid.nics.utk.edu/16073652731378661006.1367557268601599423/x509_user_proxy"; > > > > > > export X509_USER_PROXY; > > > > > > GLOBUS_LOCATION="/nics/e/sw/teragrid/gram5-5.0.2-r1"; > > > > > > export GLOBUS_LOCATION; > > > > > > GLOBUS_GRAM_JOB_CONTACT="https://grid.nics.utk.edu:50383/16073652731378661006/1367557268601599423/"; > > > > > > export GLOBUS_GRAM_JOB_CONTACT; > > > > > > HOME="/nics/a/home/victor"; > > > > > > export HOME; > > > > > > LOGNAME="victor"; > > > > > > export LOGNAME; > > > > > > GLOBUS_GASS_CACHE_DEFAULT="/nics/a/home/victor/.globus/.gass_cache"; > > > > > > export GLOBUS_GASS_CACHE_DEFAULT; > > > > > > > > > > > > #Change to directory requested by user > > > > > > cd /lustre/scratch/victor > > > > > > /sw/altd/bin/aprun -n 12 /lustre/scratch/victor/test1 < /dev/null > > > > > > > > > > > > However, when I do essentially the exactly same job using > > > globus-job-run I get the following: > > > > > > > > > > > > Kraken$ globus-job-run grid.nics.utk.edu:2119/jobmanager-pbs -count > > 12 > > > -m 1 -p UT-SUPPORT -d /lustre/scratch/victor -stdout > > > /lustre/scratch/victor/globusjobrun.out -stderr > > > /lustre/scratch/victor/globusjobrun.err /sw/altd/bin/aprun -n 12 > > > /lustre/scratch/victor/test1 > > > > > > > > > > > > Which generates the following totally different job with bad > > arguments > > > to the aprun command inside the scheduler_pbs_cmd_script removing > > the > > > ‘-n’ part of the argument: > > > > > > > > > > > > #! /bin/sh > > > > > > # PBS batch job script built by Globus job manager > > > > > > # > > > > > > #PBS -S /bin/sh > > > > > > #PBS -m n > > > > > > #PBS -A UT-SUPPORT > > > > > > #PBS -l walltime=1:00 > > > > > > #PBS -o /lustre/scratch/victor/globusjobrun.out > > > > > > #PBS -e /lustre/scratch/victor/globusjobrun.err > > > > > > #PBS -l size=12 > > > > > > X509_USER_PROXY="/nics/a/home/victor/.globus/job/grid.nics.utk.edu/16145870889033404456.1367557268601595317/x509_user_proxy"; > > > > > > export X509_USER_PROXY; > > > > > > GLOBUS_LOCATION="/nics/e/sw/teragrid/gram5-5.0.2-r1"; > > > > > > export GLOBUS_LOCATION; > > > > > > GLOBUS_GRAM_JOB_CONTACT="https://grid.nics.utk.edu:50383/16145870889033404456/1367557268601595317/"; > > > > > > export GLOBUS_GRAM_JOB_CONTACT; > > > > > > HOME="/nics/a/home/victor"; > > > > > > export HOME; > > > > > > LOGNAME="victor"; > > > > > > export LOGNAME; > > > > > > GLOBUS_GASS_CACHE_DEFAULT="/nics/a/home/victor/.globus/.gass_cache"; > > > > > > export GLOBUS_GASS_CACHE_DEFAULT; > > > > > > > > > > > > #Change to directory requested by user > > > > > > cd /lustre/scratch/victor > > > > > > > > > > > > hosts=`cat $PBS_NODEFILE`; > > > > > > counter=0 > > > > > > while test $counter -lt 12; do > > > > > > for host in $hosts; do > > > > > > if test $counter -lt 12; then > > > > > > /usr/local/openssh/bin/ssh $host "/bin/sh > > > /nics/a/home/victor/.globus/job/grid.nics.utk.edu/16145870889033404456.1367557268601595317/scheduler_pbs_cmd_script; > > > echo \$? > > > > /nics/a/home/victor/.globus/job/grid.nics.utk.edu/16145870889033404456.1367557268601595317/exit.$counter" > > > < /dev/null & > > > > > > counter=`expr $counter + 1` > > > > > > else > > > > > > break > > > > > > fi > > > > > > done > > > > > > done > > > > > > wait > > > > > > > > > > > > counter=0 > > > > > > exit_code=0 > > > > > > while test $counter -lt 12; do > > > > > > /bin/touch > > > /nics/a/home/victor/.globus/job/grid.nics.utk.edu/16145870889033404456.1367557268601595317/exit.$counter; > > > > > > > > > > > > read tmp_exit_code < > > > /nics/a/home/victor/.globus/job/grid.nics.utk.edu/16145870889033404456.1367557268601595317/exit.$counter > > > > > > if [ $exit_code = 0 -a $tmp_exit_code != 0 ]; then > > > > > > exit_code=$tmp_exit_code > > > > > > fi > > > > > > counter=`expr $counter + 1` > > > > > > done > > > > > > > > > > > > exit $exit_code > > > > > > > > > > > > with schedule_pbs_cmd_script being the following with INCORRECT > > > arguments after the aprun: > > > > > > > > > > > > #!/bin/sh -l > > > > > > cd /lustre/scratch/victor > > > > > > X509_USER_PROXY="/nics/a/home/victor/.globus/job/grid.nics.utk.edu/16145870889033404456.1367557268601595317/x509_user_proxy"; > > > > > > export X509_USER_PROXY; > > > > > > GLOBUS_LOCATION="/nics/e/sw/teragrid/gram5-5.0.2-r1"; > > > > > > export GLOBUS_LOCATION; > > > > > > GLOBUS_GRAM_JOB_CONTACT="https://grid.nics.utk.edu:50383/16145870889033404456/1367557268601595317/"; > > > > > > export GLOBUS_GRAM_JOB_CONTACT; > > > > > > HOME="/nics/a/home/victor"; > > > > > > export HOME; > > > > > > LOGNAME="victor"; > > > > > > export LOGNAME; > > > > > > GLOBUS_GASS_CACHE_DEFAULT="/nics/a/home/victor/.globus/.gass_cache"; > > > > > > export GLOBUS_GASS_CACHE_DEFAULT; > > > > > > > > > > > > /sw/altd/bin/aprun 12 /lustre/scratch/victor/test1 (note the –n 12 > > has > > > been changed to just 12) > > > > > > > > > > > > Also when I do a globus-job-run –dumprsl I get essentially the same > > > RSL as in the globusrun job above except the “-n” gets stripped off > > > which is undesirable. > > > > > > > > > > > > vic...@krakenpf8(XT5):~/globustests> globus-job-run -dumprsl > > > grid.nics.utk.edu:2119/jobmanager-pbs -count 12 -m 1 -p UT-SUPPORT > > -d > > > /lustre/scratch/victor -stdout > > /lustre/scratch/victor/globusjobrun.out > > > -stderr /lustre/scratch/victor/globusjobrun.err /sw/altd/bin/aprun > > "-n > > > 12" /lustre/scratch/victor/test1 > > > > > > &(executable="/sw/altd/bin/aprun") > > > > > > (project="UT-SUPPORT") > > > > > > (maxtime=1) > > > > > > (directory="/lustre/scratch/victor") > > > > > > (count=12) > > > > > > (arguments= "12" "/lustre/scratch/victor/test1") > > > > > > (stdout="/lustre/scratch/victor/globusjobrun.out") > > > > > > (stderr="/lustre/scratch/victor/globusjobrun.err")
