Re: [Moses-support] qsub and EMS again

Miles Osborne Thu, 02 Sep 2010 23:33:59 -0700

yes, not doing the checking during the planning stage seems sensible.
(you could just change the delay at this point to speed things up).


here in Edinburgh we use experiment.perl mainly in a multicore /
single machine setting and that is why support for slow STDERR
creation is not really there yet.  but, there are plans to port this
to Hadoop, which should solve synchronisation problems like this.
this is the next major piece of development I'll be involved with.
(the current one involves more language modelling)

Miles

On 3 September 2010 01:18, Suzy Howlett <[email protected]> wrote:
> Thanks for the responses. I think I will go with the loop. I was a bit
> confused about this at first - it considers the step to have crashed if the
> STDERR file does not exist, but since the STDERR file is the output of the
> script that creates the DONE file, I would have thought that the DONE file
> could not be created without the STDERR file ultimately following. However
> presumably if the STDERR file didn't appear for some reason, that is a
> problem, and so should be considered a crash.
>
> The unfortunate thing about putting a loop like this in check_if_crashed is
> that it also has to go through this when it's planning what steps to do,
> which could lead to a long delay in planning if a step has actually crashed
> through not creating a STDERR file.
>
> I think the problem is ultimately with our cluster. I noticed sometimes some
> jobs would be sitting on the queue with status "exiting" for several minutes
> - so the DONE file had been created but the STDERR file would not appear
> until after the job had been finally removed from the queue. Having given it
> some more thought, I think the issue may be with writing to disk. I'm pretty
> sure that the slave nodes do not have their own hard disks, only the master,
> and I think jobs may have been stalled while they waited for a chance to
> write results to disk - the master node was very very busy at the time. I
> don't know if that accounts for it! I'm not sure how there being no hard
> disks in the slaves interacts with Hieu's point - I don't really understand
> how the setup works.
>
> Thanks again,
> Suzy
>
> On 2/09/10 8:26 PM, Miles Osborne wrote:
>>
>> a better setup would be to have a loop which did the following:
>>
>> --for a given version number and step, check for STDERR, STDOUT and DONE
>> --if they are all found, exit
>> --otherwise sleep and recheck
>>
>> (and put some limit overall to prevent an endless loop)
>>
>> Miles
>>
>> On 2 September 2010 11:16, Hieu Hoang<[email protected]>  wrote:
>>>
>>>  sounds like a bad case of a network file system. you prob need to
>>> harass your sysadmin and try a few of these too
>>>    http://fixunix.com/nfs/61890-forcing-nfs-sync.html
>>>
>>> On 02/09/2010 04:09, Suzy Howlett wrote:
>>>>
>>>> Hi everyone,
>>>>
>>>> I'm running Moses through its experiment management system across a
>>>> cluster and I'm finding that sometimes jobs will finish successfully but
>>>> the .STDERR and .STDOUT files will be slow in appearing relative to the
>>>> .DONE file, meaning that the EMS concludes that the step crashed. I can
>>>> run the system again and it successfully reuses the results of the step
>>>> (it doesn't have to rerun the step) but this is becoming frustrating as
>>>> I have to restart the system
>>>> frequently. I tried adding a call to sleep() in the check_if_crashed()
>>>> method in experiment.perl but this is not helping in general - I think
>>>> sometimes the delay is as much as a couple of minutes.
>>>>
>>>> Has anyone else faced this problem, or have a better idea for how to get
>>>> around it?
>>>>
>>>> Cheers,
>>>> Suzy
>>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> [email protected]
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>
>>
>>
>
> --
> Suzy Howlett
> http://www.showlett.id.au/
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] qsub and EMS again

Reply via email to