Hi again,
I've now had a chance to try to fix my problems with the moses-parallel
script.
First, the equivalent to $SGE_TASK_ID in our system seems to be
$PBS_ARRAYID. By making that substitution, I am able to get the array of
jobs to submit to the queue. I get jobs with ids like "14708-1.draxx"
(an ordinary job would not have the "-1") and job names like
"job24557.bash-1".
By the way, I suggest adding an extra line like
$scriptheader.="jobid=$SGE_TASK_ID\n";
before the first use of $SGE_TASK_ID, and using $jobid instead, so the
change is only in one location.
I think the second problem I noted below (about printing "exit 1") still
holds.
Now, although I can get the array of jobs to execute, I can't get the
sync script to work. I had thought that the dependency should be listed
using "afterokarray" instead of the "afterok" I had been using before
(see
http://www.clusterresources.com/torquedocs21/commands/qsub.shtml)... but
our version of TORQUE isn't recognising that option! How frustrating.
I'll see if we can get a more up-to-date version of TORQUE (or maybe
switch to SGE?? I wish TORQUE had the "-sync" option...) but I don't
think that will happen quickly.
I think this is a failure from my end, unfortunately.
Suzy
On 23/12/10 9:19 AM, Suzy Howlett wrote:
> Hi Lane,
>
> Sorry it's taken me a while to get back to you. I've attempted to run an
> EMS experiment with your moses-parallel.pl, and (somewhat
> unsurprisingly) it failed.
>
> For reference, my tuning/tmp.1 directory contains:
> WR18184.W.log
> filtered/
> filterphrases.err
> filterphrases.out
> input.lc.1.split19899-aa to input.lc.1.split19899-aj
> input.lc.1.split19899.trans (empty)
> job19899.bash
> job19899.bash.e14659-1 to job19899.bash.e14659-10
> job19899.log
> job19899.sync_workaround_script.sh
> mert1.W.log (empty)
> run1.moses.ini
> run1.out (empty)
> tmp19899/ (empty)
>
> I've attached a copy of job19899.bash and job19899.bash.e14659-1.
>
> First problem: I don't have an environment variable $SGE_TASK_ID. So,
> it's always equal to "", and ${idxarray[$SGE_TASK_ID]} is also "".
>
> I'm also not sure whether I have a $TASK_ID. Is it an SGE variable?
>
> Second problem: This if statement:
> if [ "" == "$SGE_TASK_ID" ]; then
> echo "Job was not submitted as an array job
> " exit 1
> fi
> For me, this prints "exit 1" instead of executing it. I assume that's
> not what's supposed to happen?
>
> Third: the output job19899.bash.e14659-1 shows it didn't like the Moses
> command in job19899.bash. Apart from the input file not existing
> (because the affix from idxarray didn't work), can anyone spot what's
> wrong? I can't figure it out.
>
> Suzy
>
>
> On 18/12/10 1:55 AM, Lane Schwartz wrote:
>> I have a modified version of moses-parallel.pl
>> <http://moses-parallel.pl> that uses the qsub -t flag to submit child
>> jobs as array jobs. I've verified that I get identical results using the
>> modified version and the current version from trunk.
>> Before I check this in, I would appreciate it if other could do a small
>> test run to verify that the modified version works the same on their
>> systems. Suzy, I'm especially interested in your feedback, since you're
>> running Torque instead of SGE.
>> From your perspective as a user, there is no change in how you call
>> moses-parallel.pl <http://moses-parallel.pl>. The changes that you
>> should expect to see are:
>> * When you look at your child jobs using qstat or qmon, they will all
>> share the same job-ID, but will each have a unique ja-task-ID
>> * Child jobs will all show up with the name MOSES, instead of MOSES-aa,
>> MOSES-ab, etc. I tried to find a way to maintain the old naming format,
>> but AFAIK there's no way to do that with array job submission
>> * The temporary out.job* and err.job* files created during the run will
>> end with numeric suffixes (corresponding to the child ja-task-ID)
>> instead of the current alphabetic (-aa, -ab, -ac,...) suffixes. Again,
>> I tried but was unable to maintain the old naming scheme.
>> Thanks,
>> Lane
>>
>> On Tue, Dec 14, 2010 at 4:07 PM, Lane Schwartz <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>> I was wondering if any consideration has been given to using qsub's
>> job array functionality in moses-parallel.pl
>> <http://moses-parallel.pl/>.
>> Using the qsub -t flag, jobs can be tied together, so that if the
>> parent job is killed via qdel, all of the children are also killed.
>> Currently, if a parallel job needs to be killed, the children must
>> be manually deleted. This is OK if you only have one parallel job
>> running, but if you have many, and you haven't overridden the
>> default job name, things become hairier.
>> I would potentially be willing to make the change, but I wanted to
>> hear people's thoughts on the matter first.
>> Cheers,
>> Lane
>>
>>
>>
>>
>> --
>> When a place gets crowded enough to require ID's, social collapse is not
>> far away. It is time to go elsewhere. The best thing about space travel
>> is that it made it possible to go elsewhere.
>> -- R.A. Heinlein, "Time Enough For Love"
>
--
Suzy Howlett
http://www.showlett.id.au/
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support