Hi all,

I recently found a problem in an old run of a system, which I didn't 
pick up at the time because it failed silently. I'm sending this in the 
hope that someone else can learn from my mistake (and in case anyone has 
a suggestion for how best to catch it in future).

I was running the system on subversion repository revision 3590, using 
the EMS, across a cluster. The cluster uses Torque rather than SGE so 
the qsub commands are slightly different. In particular I have to use 
the -old-sge flags.

During tuning, decoding was split into 10 parts. During run 6, it seems 
that one of the ten "best100" files was slow in appearing, and was not 
incorporated into run6.best100.out. These translations were then not 
available for the next round of tuning, leading the system to converge 
to a completely different (worse) point. The 1-best translations were 
all there, however, so no error was recorded.

If anyone needed any more convincing that you need to run your systems 
more than once, let this be an example.

My best guess at a check for this is that the 
scripts/generic/moses-parallel.pl check_translation_old_sge method needs 
to check that the n-best files have appeared for each split. Does this 
sound right, or is there a better place for the check? (I haven't been 
following updates to Moses for a little while, so if this is all made 
redundant by recent changes, my apologies.)

Suzy

-- 
Suzy Howlett
http://www.showlett.id.au/
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to