had a quick peek at the code. You're prob right, the best place to make the
change is probably
   check_translation_old_sge()
There isn't a check for it in the latest code.

I don't run torque or an old SGE (or any grid regularly) so I'm not gonna
make the change as I can't test it. If you do, it'll be good to have your
patch


On 7 December 2011 08:24, Suzy Howlett <[email protected]> wrote:

> Hi Hieu,
>
> I recognise that the usual argument for running the system multiple
> times is to counter random variation, but I believe that if I'd run the
> system more than once I'd have picked the error up at the time, instead
> of several months later. It's a different argument, but to the same end:
> a translation system is complex, and a single run may not be
> representative for a number of reasons, including random failure. As far
> as I know, the problem does not occur so frequently that the same error
> would have happened on a second or third run.
>
> Regarding NFS, I had to check, and was told that yes, we're using NFS to
> mount volumes on the slave nodes, and was also told that there's
> suspicion that it's only partially working on our cluster. (Oh, yay.)
>
> The only bug (or quasi-bug) I suspect in Moses is that it did not
> confirm that all best100 files were there before proceeding to the next
> round of tuning; it just assumed that since the 1-best translations were
> there, then the 100-best translations should also be. It may not be a
> bug per se, since our cluster might just be a pathological case, but I
> think it could be made more robust.
>
> Suzy
>
> On 7/12/11 2:40 AM, Hieu Hoang wrote:
> > hi suzy
> >
> > this seems quite serious. Do you know if this occurs frequently? Are
> > your files on NFS?
> >
> > There have been some similar problems reported when running on NFS
> > because of filesystem delay. However, they usually cause errors which
> > can be seen and dealt quickly with, eg. by putting in extra waits or
> > explict checking for files. Your error isn't detected unless you trawl
> > through the mert iterations.
> >
> > The need to rerun tuning is just to counter the random variability in
> > mert but this would be a definite bug that needs fixing
> >
> > On 06/12/2011 07:23, Suzy Howlett wrote:
> >> Hi all,
> >>
> >> I recently found a problem in an old run of a system, which I didn't
> >> pick up at the time because it failed silently. I'm sending this in the
> >> hope that someone else can learn from my mistake (and in case anyone has
> >> a suggestion for how best to catch it in future).
> >>
> >> I was running the system on subversion repository revision 3590, using
> >> the EMS, across a cluster. The cluster uses Torque rather than SGE so
> >> the qsub commands are slightly different. In particular I have to use
> >> the -old-sge flags.
> >>
> >> During tuning, decoding was split into 10 parts. During run 6, it seems
> >> that one of the ten "best100" files was slow in appearing, and was not
> >> incorporated into run6.best100.out. These translations were then not
> >> available for the next round of tuning, leading the system to converge
> >> to a completely different (worse) point. The 1-best translations were
> >> all there, however, so no error was recorded.
> >>
> >> If anyone needed any more convincing that you need to run your systems
> >> more than once, let this be an example.
> >>
> >> My best guess at a check for this is that the
> >> scripts/generic/moses-parallel.pl check_translation_old_sge method
> needs
> >> to check that the n-best files have appeared for each split. Does this
> >> sound right, or is there a better place for the check? (I haven't been
> >> following updates to Moses for a little while, so if this is all made
> >> redundant by recent changes, my apologies.)
> >>
> >> Suzy
> >>
> > _______________________________________________
> > Moses-support mailing list
> > [email protected]
> > http://mailman.mit.edu/mailman/listinfo/moses-support
>
> --
> Suzy Howlett
> http://www.showlett.id.au/
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to