Suzy, Sorry for the delay. I must have missed your message over Christmas. I've addressed two of your issues. The problem with printing exit 1 was a typo - the " should have been before the newline, not after. The script now checks for PBS_ARRAYID and uses it if it is set and SGE_TASK_ID is not.
I looked at your log file. I don't see anything wrong with how moses is being called. My best guess is that the version of moses that you're using is different than the version of moses the script is expecting, and one of the flags you're passing in your moses.ini file could be deprecated. The latest version of the script is in my branch of moses. My suggestion would be to check out my branch, compile it, then try running moses-parallel.pl using that version of moses. svn co https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/branches/lane-syntax As far as the sync problem, I agree that is frustrating. The documentation you pointed to does make it sound like waiting on array jobs is supported by TORQUE, but if your implementation isn't accepting that flag, I don't know a good workaround. If you can get an updated version of TORQUE that accepts the afterokarray flag, I'd be happy to incorporate that. Cheers, Lane On Fri, Dec 24, 2010 at 1:31 AM, Suzy Howlett <[email protected]> wrote: > Hi again, > > I've now had a chance to try to fix my problems with the moses-parallel > script. > > First, the equivalent to $SGE_TASK_ID in our system seems to be > $PBS_ARRAYID. By making that substitution, I am able to get the array of > jobs to submit to the queue. I get jobs with ids like "14708-1.draxx" (an > ordinary job would not have the "-1") and job names like "job24557.bash-1". > > By the way, I suggest adding an extra line like > $scriptheader.="jobid=$SGE_TASK_ID\n"; > before the first use of $SGE_TASK_ID, and using $jobid instead, so the > change is only in one location. > > I think the second problem I noted below (about printing "exit 1") still > holds. > > Now, although I can get the array of jobs to execute, I can't get the sync > script to work. I had thought that the dependency should be listed using > "afterokarray" instead of the "afterok" I had been using before (see > http://www.clusterresources.com/torquedocs21/commands/qsub.shtml)... but > our version of TORQUE isn't recognising that option! How frustrating. I'll > see if we can get a more up-to-date version of TORQUE (or maybe switch to > SGE?? I wish TORQUE had the "-sync" option...) but I don't think that will > happen quickly. > > I think this is a failure from my end, unfortunately. > > Suzy > > > On 23/12/10 9:19 AM, Suzy Howlett wrote: > >> Hi Lane, >> >> Sorry it's taken me a while to get back to you. I've attempted to run an >> EMS experiment with your moses-parallel.pl, and (somewhat >> unsurprisingly) it failed. >> >> For reference, my tuning/tmp.1 directory contains: >> WR18184.W.log >> filtered/ >> filterphrases.err >> filterphrases.out >> input.lc.1.split19899-aa to input.lc.1.split19899-aj >> input.lc.1.split19899.trans (empty) >> job19899.bash >> job19899.bash.e14659-1 to job19899.bash.e14659-10 >> job19899.log >> job19899.sync_workaround_script.sh >> mert1.W.log (empty) >> run1.moses.ini >> run1.out (empty) >> tmp19899/ (empty) >> >> I've attached a copy of job19899.bash and job19899.bash.e14659-1. >> >> First problem: I don't have an environment variable $SGE_TASK_ID. So, >> it's always equal to "", and ${idxarray[$SGE_TASK_ID]} is also "". >> >> I'm also not sure whether I have a $TASK_ID. Is it an SGE variable? >> >> Second problem: This if statement: >> if [ "" == "$SGE_TASK_ID" ]; then >> echo "Job was not submitted as an array job >> " exit 1 >> fi >> For me, this prints "exit 1" instead of executing it. I assume that's >> not what's supposed to happen? >> >> Third: the output job19899.bash.e14659-1 shows it didn't like the Moses >> command in job19899.bash. Apart from the input file not existing >> (because the affix from idxarray didn't work), can anyone spot what's >> wrong? I can't figure it out. >> >> Suzy >> >> >> On 18/12/10 1:55 AM, Lane Schwartz wrote: >> >>> I have a modified version of moses-parallel.pl >>> <http://moses-parallel.pl> that uses the qsub -t flag to submit child >>> jobs as array jobs. I've verified that I get identical results using the >>> modified version and the current version from trunk. >>> Before I check this in, I would appreciate it if other could do a small >>> test run to verify that the modified version works the same on their >>> systems. Suzy, I'm especially interested in your feedback, since you're >>> running Torque instead of SGE. >>> From your perspective as a user, there is no change in how you call >>> moses-parallel.pl <http://moses-parallel.pl>. The changes that you >>> should expect to see are: >>> * When you look at your child jobs using qstat or qmon, they will all >>> share the same job-ID, but will each have a unique ja-task-ID >>> * Child jobs will all show up with the name MOSES, instead of MOSES-aa, >>> MOSES-ab, etc. I tried to find a way to maintain the old naming format, >>> but AFAIK there's no way to do that with array job submission >>> * The temporary out.job* and err.job* files created during the run will >>> end with numeric suffixes (corresponding to the child ja-task-ID) >>> instead of the current alphabetic (-aa, -ab, -ac,...) suffixes. Again, >>> I tried but was unable to maintain the old naming scheme. >>> Thanks, >>> Lane >>> >>> On Tue, Dec 14, 2010 at 4:07 PM, Lane Schwartz <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> I was wondering if any consideration has been given to using qsub's >>> job array functionality in moses-parallel.pl >>> <http://moses-parallel.pl/>. >>> Using the qsub -t flag, jobs can be tied together, so that if the >>> parent job is killed via qdel, all of the children are also killed. >>> Currently, if a parallel job needs to be killed, the children must >>> be manually deleted. This is OK if you only have one parallel job >>> running, but if you have many, and you haven't overridden the >>> default job name, things become hairier. >>> I would potentially be willing to make the change, but I wanted to >>> hear people's thoughts on the matter first. >>> Cheers, >>> Lane >>> >>> >>> >>> >>> -- >>> When a place gets crowded enough to require ID's, social collapse is not >>> far away. It is time to go elsewhere. The best thing about space travel >>> is that it made it possible to go elsewhere. >>> -- R.A. Heinlein, "Time Enough For Love" >>> >> >> > -- > Suzy Howlett > http://www.showlett.id.au/ > -- When a place gets crowded enough to require ID's, social collapse is not far away. It is time to go elsewhere. The best thing about space travel is that it made it possible to go elsewhere. -- R.A. Heinlein, "Time Enough For Love"
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
