Hi Lane, In the end the failing call to moses was a red herring; it was fine after all. My problems ended up being checking for PBS_ARRAYID and our version of TORQUE not accepting array dependencies. Unfortunately, it's unlikely we'll get TORQUE updated quickly, so I won't be able to test it for you yet, but I expect once the array dependencies issue is resolved things will go smoothly.
I don't know of anyone else using Moses with TORQUE, so there is little need to incorporate PBS_ARRAYID or other TORQUE-specific code into your commit. I suggest sticking to SGE only, and I'll make any TORQUE-specific changes locally, as I do elsewhere in the code. On the other hand, if there are more people using cluster software other than SGE, it may be worth attempting to factor all the SGE-specific code from throughout Moses into one file, which can be substituted with a TORQUE version or whatever-else version to make a smoother configuration. I haven't checked how possible or easy that would be to do, though, so take that idea with a grain of salt. Best, Suzy On 20/01/11 6:07 AM, Lane Schwartz wrote: > Suzy, > Sorry for the delay. I must have missed your message over Christmas. > I've addressed two of your issues. The problem with printing exit 1 was > a typo - the " should have been before the newline, not after. The > script now checks for PBS_ARRAYID and uses it if it is set and > SGE_TASK_ID is not. > I looked at your log file. I don't see anything wrong with how moses is > being called. My best guess is that the version of moses that you're > using is different than the version of moses the script is expecting, > and one of the flags you're passing in your moses.ini file could be > deprecated. > The latest version of the script is in my branch of moses. My suggestion > would be to check out my branch, compile it, then try running > moses-parallel.pl <http://moses-parallel.pl> using that version of moses. > svn co > https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/branches/lane-syntax > As far as the sync problem, I agree that is frustrating. The > documentation you pointed to does make it sound like waiting on array > jobs is supported by TORQUE, but if your implementation isn't accepting > that flag, I don't know a good workaround. If you can get an updated > version of TORQUE that accepts the afterokarray flag, I'd be happy to > incorporate that. > Cheers, > Lane > > On Fri, Dec 24, 2010 at 1:31 AM, Suzy Howlett <[email protected] > <mailto:[email protected]>> wrote: > > Hi again, > > I've now had a chance to try to fix my problems with the > moses-parallel script. > > First, the equivalent to $SGE_TASK_ID in our system seems to be > $PBS_ARRAYID. By making that substitution, I am able to get the > array of jobs to submit to the queue. I get jobs with ids like > "14708-1.draxx" (an ordinary job would not have the "-1") and job > names like "job24557.bash-1". > > By the way, I suggest adding an extra line like > $scriptheader.="jobid=$SGE_TASK_ID\n"; > before the first use of $SGE_TASK_ID, and using $jobid instead, so > the change is only in one location. > > I think the second problem I noted below (about printing "exit 1") > still holds. > > Now, although I can get the array of jobs to execute, I can't get > the sync script to work. I had thought that the dependency should be > listed using "afterokarray" instead of the "afterok" I had been > using before (see > http://www.clusterresources.com/torquedocs21/commands/qsub.shtml)... > but our version of TORQUE isn't recognising that option! How > frustrating. I'll see if we can get a more up-to-date version of > TORQUE (or maybe switch to SGE?? I wish TORQUE had the "-sync" > option...) but I don't think that will happen quickly. > > I think this is a failure from my end, unfortunately. > > Suzy > > > On 23/12/10 9:19 AM, Suzy Howlett wrote: > > Hi Lane, > > Sorry it's taken me a while to get back to you. I've attempted > to run an > EMS experiment with your moses-parallel.pl > <http://moses-parallel.pl/>, and (somewhat > unsurprisingly) it failed. > > For reference, my tuning/tmp.1 directory contains: > WR18184.W.log > filtered/ > filterphrases.err > filterphrases.out > input.lc.1.split19899-aa to input.lc.1.split19899-aj > input.lc.1.split19899.trans (empty) > job19899.bash > job19899.bash.e14659-1 to job19899.bash.e14659-10 > job19899.log > job19899.sync_workaround_script.sh > <http://job19899.sync_workaround_script.sh/> > mert1.W.log (empty) > run1.moses.ini > run1.out (empty) > tmp19899/ (empty) > > I've attached a copy of job19899.bash and job19899.bash.e14659-1. > > First problem: I don't have an environment variable > $SGE_TASK_ID. So, > it's always equal to "", and ${idxarray[$SGE_TASK_ID]} is also "". > > I'm also not sure whether I have a $TASK_ID. Is it an SGE variable? > > Second problem: This if statement: > if [ "" == "$SGE_TASK_ID" ]; then > echo "Job was not submitted as an array job > " exit 1 > fi > For me, this prints "exit 1" instead of executing it. I assume > that's > not what's supposed to happen? > > Third: the output job19899.bash.e14659-1 shows it didn't like > the Moses > command in job19899.bash. Apart from the input file not existing > (because the affix from idxarray didn't work), can anyone spot > what's > wrong? I can't figure it out. > > Suzy > > > On 18/12/10 1:55 AM, Lane Schwartz wrote: > > I have a modified version of moses-parallel.pl > <http://moses-parallel.pl/> > <http://moses-parallel.pl <http://moses-parallel.pl/>> that > uses the qsub -t flag to submit child > jobs as array jobs. I've verified that I get identical > results using the > modified version and the current version from trunk. > Before I check this in, I would appreciate it if other could > do a small > test run to verify that the modified version works the same > on their > systems. Suzy, I'm especially interested in your feedback, > since you're > running Torque instead of SGE. > From your perspective as a user, there is no change in how > you call > moses-parallel.pl <http://moses-parallel.pl/> > <http://moses-parallel.pl <http://moses-parallel.pl/>>. The > changes that you > should expect to see are: > * When you look at your child jobs using qstat or qmon, they > will all > share the same job-ID, but will each have a unique ja-task-ID > * Child jobs will all show up with the name MOSES, instead > of MOSES-aa, > MOSES-ab, etc. I tried to find a way to maintain the old > naming format, > but AFAIK there's no way to do that with array job submission > * The temporary out.job* and err.job* files created during > the run will > end with numeric suffixes (corresponding to the child > ja-task-ID) > instead of the current alphabetic (-aa, -ab, -ac,...) > suffixes. Again, > I tried but was unable to maintain the old naming scheme. > Thanks, > Lane > > On Tue, Dec 14, 2010 at 4:07 PM, Lane Schwartz > <[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>>> wrote: > > I was wondering if any consideration has been given to using > qsub's > job array functionality in moses-parallel.pl > <http://moses-parallel.pl/> > <http://moses-parallel.pl/>. > Using the qsub -t flag, jobs can be tied together, so that > if the > parent job is killed via qdel, all of the children are also > killed. > Currently, if a parallel job needs to be killed, the > children must > be manually deleted. This is OK if you only have one > parallel job > running, but if you have many, and you haven't overridden the > default job name, things become hairier. > I would potentially be willing to make the change, but I > wanted to > hear people's thoughts on the matter first. > Cheers, > Lane > > > > > -- > When a place gets crowded enough to require ID's, social > collapse is not > far away. It is time to go elsewhere. The best thing about > space travel > is that it made it possible to go elsewhere. > -- R.A. Heinlein, "Time Enough For Love" > > > > -- > Suzy Howlett > http://www.showlett.id.au/ > > > > > -- > When a place gets crowded enough to require ID's, social collapse is not > far away. It is time to go elsewhere. The best thing about space travel > is that it made it possible to go elsewhere. > -- R.A. Heinlein, "Time Enough For Love" -- Suzy Howlett http://www.showlett.id.au/ _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
