Hi all, I recently found a problem in an old run of a system, which I didn't pick up at the time because it failed silently. I'm sending this in the hope that someone else can learn from my mistake (and in case anyone has a suggestion for how best to catch it in future).
I was running the system on subversion repository revision 3590, using the EMS, across a cluster. The cluster uses Torque rather than SGE so the qsub commands are slightly different. In particular I have to use the -old-sge flags. During tuning, decoding was split into 10 parts. During run 6, it seems that one of the ten "best100" files was slow in appearing, and was not incorporated into run6.best100.out. These translations were then not available for the next round of tuning, leading the system to converge to a completely different (worse) point. The 1-best translations were all there, however, so no error was recorded. If anyone needed any more convincing that you need to run your systems more than once, let this be an example. My best guess at a check for this is that the scripts/generic/moses-parallel.pl check_translation_old_sge method needs to check that the n-best files have appeared for each split. Does this sound right, or is there a better place for the check? (I haven't been following updates to Moses for a little while, so if this is all made redundant by recent changes, my apologies.) Suzy -- Suzy Howlett http://www.showlett.id.au/ _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
