Hi Lane,

In the end the failing call to moses was a red herring; it was fine 
after all. My problems ended up being checking for PBS_ARRAYID and our 
version of TORQUE not accepting array dependencies. Unfortunately, it's 
unlikely we'll get TORQUE updated quickly, so I won't be able to test it 
for you yet, but I expect once the array dependencies issue is resolved 
things will go smoothly.

I don't know of anyone else using Moses with TORQUE, so there is little 
need to incorporate PBS_ARRAYID or other TORQUE-specific code into your 
commit. I suggest sticking to SGE only, and I'll make any 
TORQUE-specific changes locally, as I do elsewhere in the code.

On the other hand, if there are more people using cluster software other 
than SGE, it may be worth attempting to factor all the SGE-specific code 
from throughout Moses into one file, which can be substituted with a 
TORQUE version or whatever-else version to make a smoother 
configuration. I haven't checked how possible or easy that would be to 
do, though, so take that idea with a grain of salt.

Best,
Suzy

On 20/01/11 6:07 AM, Lane Schwartz wrote:
> Suzy,
> Sorry for the delay. I must have missed your message over Christmas.
> I've addressed two of your issues. The problem with printing exit 1 was
> a typo - the " should have been before the newline, not after. The
> script now checks for PBS_ARRAYID and uses it if it is set and
> SGE_TASK_ID is not.
> I looked at your log file. I don't see anything wrong with how moses is
> being called. My best guess is that the version of moses that you're
> using is different than the version of moses the script is expecting,
> and one of the flags you're passing in your moses.ini file could be
> deprecated.
> The latest version of the script is in my branch of moses. My suggestion
> would be to check out my branch, compile it, then try running
> moses-parallel.pl <http://moses-parallel.pl> using that version of moses.
> svn co
> https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/branches/lane-syntax
> As far as the sync problem, I agree that is frustrating. The
> documentation you pointed to does make it sound like waiting on array
> jobs is supported by TORQUE, but if your implementation isn't accepting
> that flag, I don't know a good workaround. If you can get an updated
> version of TORQUE that accepts the afterokarray flag, I'd be happy to
> incorporate that.
> Cheers,
> Lane
>
> On Fri, Dec 24, 2010 at 1:31 AM, Suzy Howlett <[email protected]
> <mailto:[email protected]>> wrote:
>
>     Hi again,
>
>     I've now had a chance to try to fix my problems with the
>     moses-parallel script.
>
>     First, the equivalent to $SGE_TASK_ID in our system seems to be
>     $PBS_ARRAYID. By making that substitution, I am able to get the
>     array of jobs to submit to the queue. I get jobs with ids like
>     "14708-1.draxx" (an ordinary job would not have the "-1") and job
>     names like "job24557.bash-1".
>
>     By the way, I suggest adding an extra line like
>          $scriptheader.="jobid=$SGE_TASK_ID\n";
>     before the first use of $SGE_TASK_ID, and using $jobid instead, so
>     the change is only in one location.
>
>     I think the second problem I noted below (about printing "exit 1")
>     still holds.
>
>     Now, although I can get the array of jobs to execute, I can't get
>     the sync script to work. I had thought that the dependency should be
>     listed using "afterokarray" instead of the "afterok" I had been
>     using before (see
>     http://www.clusterresources.com/torquedocs21/commands/qsub.shtml)...
>     but our version of TORQUE isn't recognising that option! How
>     frustrating. I'll see if we can get a more up-to-date version of
>     TORQUE (or maybe switch to SGE?? I wish TORQUE had the "-sync"
>     option...) but I don't think that will happen quickly.
>
>     I think this is a failure from my end, unfortunately.
>
>     Suzy
>
>
>     On 23/12/10 9:19 AM, Suzy Howlett wrote:
>
>         Hi Lane,
>
>         Sorry it's taken me a while to get back to you. I've attempted
>         to run an
>         EMS experiment with your moses-parallel.pl
>         <http://moses-parallel.pl/>, and (somewhat
>         unsurprisingly) it failed.
>
>         For reference, my tuning/tmp.1 directory contains:
>         WR18184.W.log
>         filtered/
>         filterphrases.err
>         filterphrases.out
>         input.lc.1.split19899-aa to input.lc.1.split19899-aj
>         input.lc.1.split19899.trans (empty)
>         job19899.bash
>         job19899.bash.e14659-1 to job19899.bash.e14659-10
>         job19899.log
>         job19899.sync_workaround_script.sh
>         <http://job19899.sync_workaround_script.sh/>
>         mert1.W.log (empty)
>         run1.moses.ini
>         run1.out (empty)
>         tmp19899/ (empty)
>
>         I've attached a copy of job19899.bash and job19899.bash.e14659-1.
>
>         First problem: I don't have an environment variable
>         $SGE_TASK_ID. So,
>         it's always equal to "", and ${idxarray[$SGE_TASK_ID]} is also "".
>
>         I'm also not sure whether I have a $TASK_ID. Is it an SGE variable?
>
>         Second problem: This if statement:
>         if [ "" == "$SGE_TASK_ID" ]; then
>         echo "Job was not submitted as an array job
>         " exit 1
>         fi
>         For me, this prints "exit 1" instead of executing it. I assume
>         that's
>         not what's supposed to happen?
>
>         Third: the output job19899.bash.e14659-1 shows it didn't like
>         the Moses
>         command in job19899.bash. Apart from the input file not existing
>         (because the affix from idxarray didn't work), can anyone spot
>         what's
>         wrong? I can't figure it out.
>
>         Suzy
>
>
>         On 18/12/10 1:55 AM, Lane Schwartz wrote:
>
>             I have a modified version of moses-parallel.pl
>             <http://moses-parallel.pl/>
>             <http://moses-parallel.pl <http://moses-parallel.pl/>> that
>             uses the qsub -t flag to submit child
>             jobs as array jobs. I've verified that I get identical
>             results using the
>             modified version and the current version from trunk.
>             Before I check this in, I would appreciate it if other could
>             do a small
>             test run to verify that the modified version works the same
>             on their
>             systems. Suzy, I'm especially interested in your feedback,
>             since you're
>             running Torque instead of SGE.
>              From your perspective as a user, there is no change in how
>             you call
>             moses-parallel.pl <http://moses-parallel.pl/>
>             <http://moses-parallel.pl <http://moses-parallel.pl/>>. The
>             changes that you
>             should expect to see are:
>             * When you look at your child jobs using qstat or qmon, they
>             will all
>             share the same job-ID, but will each have a unique ja-task-ID
>             * Child jobs will all show up with the name MOSES, instead
>             of MOSES-aa,
>             MOSES-ab, etc. I tried to find a way to maintain the old
>             naming format,
>             but AFAIK there's no way to do that with array job submission
>             * The temporary out.job* and err.job* files created during
>             the run will
>             end with numeric suffixes (corresponding to the child
>             ja-task-ID)
>             instead of the current alphabetic (-aa, -ab, -ac,...)
>             suffixes. Again,
>             I tried but was unable to maintain the old naming scheme.
>             Thanks,
>             Lane
>
>             On Tue, Dec 14, 2010 at 4:07 PM, Lane Schwartz
>             <[email protected] <mailto:[email protected]>
>             <mailto:[email protected] <mailto:[email protected]>>> wrote:
>
>             I was wondering if any consideration has been given to using
>             qsub's
>             job array functionality in moses-parallel.pl
>             <http://moses-parallel.pl/>
>             <http://moses-parallel.pl/>.
>             Using the qsub -t flag, jobs can be tied together, so that
>             if the
>             parent job is killed via qdel, all of the children are also
>             killed.
>             Currently, if a parallel job needs to be killed, the
>             children must
>             be manually deleted. This is OK if you only have one
>             parallel job
>             running, but if you have many, and you haven't overridden the
>             default job name, things become hairier.
>             I would potentially be willing to make the change, but I
>             wanted to
>             hear people's thoughts on the matter first.
>             Cheers,
>             Lane
>
>
>
>
>             --
>             When a place gets crowded enough to require ID's, social
>             collapse is not
>             far away. It is time to go elsewhere. The best thing about
>             space travel
>             is that it made it possible to go elsewhere.
>             -- R.A. Heinlein, "Time Enough For Love"
>
>
>
>     --
>     Suzy Howlett
>     http://www.showlett.id.au/
>
>
>
>
> --
> When a place gets crowded enough to require ID's, social collapse is not
> far away.  It is time to go elsewhere.  The best thing about space travel
> is that it made it possible to go elsewhere.
>                  -- R.A. Heinlein, "Time Enough For Love"

-- 
Suzy Howlett
http://www.showlett.id.au/
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to