Re: [Moses-support] a cautionary tale about tuning on a cluster

Suzy Howlett Tue, 06 Dec 2011 17:26:01 -0800

Hi Hieu,

I recognise that the usual argument for running the system multiple 
times is to counter random variation, but I believe that if I'd run the 
system more than once I'd have picked the error up at the time, instead 
of several months later. It's a different argument, but to the same end: 
a translation system is complex, and a single run may not be 
representative for a number of reasons, including random failure. As far 
as I know, the problem does not occur so frequently that the same error 
would have happened on a second or third run.


Regarding NFS, I had to check, and was told that yes, we're using NFS to 
mount volumes on the slave nodes, and was also told that there's 
suspicion that it's only partially working on our cluster. (Oh, yay.)

The only bug (or quasi-bug) I suspect in Moses is that it did not 
confirm that all best100 files were there before proceeding to the next 
round of tuning; it just assumed that since the 1-best translations were 
there, then the 100-best translations should also be. It may not be a 
bug per se, since our cluster might just be a pathological case, but I 
think it could be made more robust.

Suzy

On 7/12/11 2:40 AM, Hieu Hoang wrote:
> hi suzy
>
> this seems quite serious. Do you know if this occurs frequently? Are
> your files on NFS?
>
> There have been some similar problems reported when running on NFS
> because of filesystem delay. However, they usually cause errors which
> can be seen and dealt quickly with, eg. by putting in extra waits or
> explict checking for files. Your error isn't detected unless you trawl
> through the mert iterations.
>
> The need to rerun tuning is just to counter the random variability in
> mert but this would be a definite bug that needs fixing
>
> On 06/12/2011 07:23, Suzy Howlett wrote:
>> Hi all,
>>
>> I recently found a problem in an old run of a system, which I didn't
>> pick up at the time because it failed silently. I'm sending this in the
>> hope that someone else can learn from my mistake (and in case anyone has
>> a suggestion for how best to catch it in future).
>>
>> I was running the system on subversion repository revision 3590, using
>> the EMS, across a cluster. The cluster uses Torque rather than SGE so
>> the qsub commands are slightly different. In particular I have to use
>> the -old-sge flags.
>>
>> During tuning, decoding was split into 10 parts. During run 6, it seems
>> that one of the ten "best100" files was slow in appearing, and was not
>> incorporated into run6.best100.out. These translations were then not
>> available for the next round of tuning, leading the system to converge
>> to a completely different (worse) point. The 1-best translations were
>> all there, however, so no error was recorded.
>>
>> If anyone needed any more convincing that you need to run your systems
>> more than once, let this be an example.
>>
>> My best guess at a check for this is that the
>> scripts/generic/moses-parallel.pl check_translation_old_sge method needs
>> to check that the n-best files have appeared for each split. Does this
>> sound right, or is there a better place for the check? (I haven't been
>> following updates to Moses for a little while, so if this is all made
>> redundant by recent changes, my apologies.)
>>
>> Suzy
>>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support

-- 
Suzy Howlett
http://www.showlett.id.au/
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] a cautionary tale about tuning on a cluster

Reply via email to