had a quick peek at the code. You're prob right, the best place to make the change is probably check_translation_old_sge() There isn't a check for it in the latest code.
I don't run torque or an old SGE (or any grid regularly) so I'm not gonna make the change as I can't test it. If you do, it'll be good to have your patch On 7 December 2011 08:24, Suzy Howlett <[email protected]> wrote: > Hi Hieu, > > I recognise that the usual argument for running the system multiple > times is to counter random variation, but I believe that if I'd run the > system more than once I'd have picked the error up at the time, instead > of several months later. It's a different argument, but to the same end: > a translation system is complex, and a single run may not be > representative for a number of reasons, including random failure. As far > as I know, the problem does not occur so frequently that the same error > would have happened on a second or third run. > > Regarding NFS, I had to check, and was told that yes, we're using NFS to > mount volumes on the slave nodes, and was also told that there's > suspicion that it's only partially working on our cluster. (Oh, yay.) > > The only bug (or quasi-bug) I suspect in Moses is that it did not > confirm that all best100 files were there before proceeding to the next > round of tuning; it just assumed that since the 1-best translations were > there, then the 100-best translations should also be. It may not be a > bug per se, since our cluster might just be a pathological case, but I > think it could be made more robust. > > Suzy > > On 7/12/11 2:40 AM, Hieu Hoang wrote: > > hi suzy > > > > this seems quite serious. Do you know if this occurs frequently? Are > > your files on NFS? > > > > There have been some similar problems reported when running on NFS > > because of filesystem delay. However, they usually cause errors which > > can be seen and dealt quickly with, eg. by putting in extra waits or > > explict checking for files. Your error isn't detected unless you trawl > > through the mert iterations. > > > > The need to rerun tuning is just to counter the random variability in > > mert but this would be a definite bug that needs fixing > > > > On 06/12/2011 07:23, Suzy Howlett wrote: > >> Hi all, > >> > >> I recently found a problem in an old run of a system, which I didn't > >> pick up at the time because it failed silently. I'm sending this in the > >> hope that someone else can learn from my mistake (and in case anyone has > >> a suggestion for how best to catch it in future). > >> > >> I was running the system on subversion repository revision 3590, using > >> the EMS, across a cluster. The cluster uses Torque rather than SGE so > >> the qsub commands are slightly different. In particular I have to use > >> the -old-sge flags. > >> > >> During tuning, decoding was split into 10 parts. During run 6, it seems > >> that one of the ten "best100" files was slow in appearing, and was not > >> incorporated into run6.best100.out. These translations were then not > >> available for the next round of tuning, leading the system to converge > >> to a completely different (worse) point. The 1-best translations were > >> all there, however, so no error was recorded. > >> > >> If anyone needed any more convincing that you need to run your systems > >> more than once, let this be an example. > >> > >> My best guess at a check for this is that the > >> scripts/generic/moses-parallel.pl check_translation_old_sge method > needs > >> to check that the n-best files have appeared for each split. Does this > >> sound right, or is there a better place for the check? (I haven't been > >> following updates to Moses for a little while, so if this is all made > >> redundant by recent changes, my apologies.) > >> > >> Suzy > >> > > _______________________________________________ > > Moses-support mailing list > > [email protected] > > http://mailman.mit.edu/mailman/listinfo/moses-support > > -- > Suzy Howlett > http://www.showlett.id.au/ > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
