Thanks for the responses. I think I will go with the loop. I was a bit 
confused about this at first - it considers the step to have crashed if 
the STDERR file does not exist, but since the STDERR file is the output 
of the script that creates the DONE file, I would have thought that the 
DONE file could not be created without the STDERR file ultimately 
following. However presumably if the STDERR file didn't appear for some 
reason, that is a problem, and so should be considered a crash.

The unfortunate thing about putting a loop like this in check_if_crashed 
is that it also has to go through this when it's planning what steps to 
do, which could lead to a long delay in planning if a step has actually 
crashed through not creating a STDERR file.

I think the problem is ultimately with our cluster. I noticed sometimes 
some jobs would be sitting on the queue with status "exiting" for 
several minutes - so the DONE file had been created but the STDERR file 
would not appear until after the job had been finally removed from the 
queue. Having given it some more thought, I think the issue may be with 
writing to disk. I'm pretty sure that the slave nodes do not have their 
own hard disks, only the master, and I think jobs may have been stalled 
while they waited for a chance to write results to disk - the master 
node was very very busy at the time. I don't know if that accounts for 
it! I'm not sure how there being no hard disks in the slaves interacts 
with Hieu's point - I don't really understand how the setup works.

Thanks again,
Suzy

On 2/09/10 8:26 PM, Miles Osborne wrote:
> a better setup would be to have a loop which did the following:
>
> --for a given version number and step, check for STDERR, STDOUT and DONE
> --if they are all found, exit
> --otherwise sleep and recheck
>
> (and put some limit overall to prevent an endless loop)
>
> Miles
>
> On 2 September 2010 11:16, Hieu Hoang<[email protected]>  wrote:
>>   sounds like a bad case of a network file system. you prob need to
>> harass your sysadmin and try a few of these too
>>     http://fixunix.com/nfs/61890-forcing-nfs-sync.html
>>
>> On 02/09/2010 04:09, Suzy Howlett wrote:
>>> Hi everyone,
>>>
>>> I'm running Moses through its experiment management system across a
>>> cluster and I'm finding that sometimes jobs will finish successfully but
>>> the .STDERR and .STDOUT files will be slow in appearing relative to the
>>> .DONE file, meaning that the EMS concludes that the step crashed. I can
>>> run the system again and it successfully reuses the results of the step
>>> (it doesn't have to rerun the step) but this is becoming frustrating as
>>> I have to restart the system
>>> frequently. I tried adding a call to sleep() in the check_if_crashed()
>>> method in experiment.perl but this is not helping in general - I think
>>> sometimes the delay is as much as a couple of minutes.
>>>
>>> Has anyone else faced this problem, or have a better idea for how to get
>>> around it?
>>>
>>> Cheers,
>>> Suzy
>>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>
>
>

-- 
Suzy Howlett
http://www.showlett.id.au/
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to