Thanks for the responses. I think I will go with the loop. I was a bit confused about this at first - it considers the step to have crashed if the STDERR file does not exist, but since the STDERR file is the output of the script that creates the DONE file, I would have thought that the DONE file could not be created without the STDERR file ultimately following. However presumably if the STDERR file didn't appear for some reason, that is a problem, and so should be considered a crash.
The unfortunate thing about putting a loop like this in check_if_crashed is that it also has to go through this when it's planning what steps to do, which could lead to a long delay in planning if a step has actually crashed through not creating a STDERR file. I think the problem is ultimately with our cluster. I noticed sometimes some jobs would be sitting on the queue with status "exiting" for several minutes - so the DONE file had been created but the STDERR file would not appear until after the job had been finally removed from the queue. Having given it some more thought, I think the issue may be with writing to disk. I'm pretty sure that the slave nodes do not have their own hard disks, only the master, and I think jobs may have been stalled while they waited for a chance to write results to disk - the master node was very very busy at the time. I don't know if that accounts for it! I'm not sure how there being no hard disks in the slaves interacts with Hieu's point - I don't really understand how the setup works. Thanks again, Suzy On 2/09/10 8:26 PM, Miles Osborne wrote: > a better setup would be to have a loop which did the following: > > --for a given version number and step, check for STDERR, STDOUT and DONE > --if they are all found, exit > --otherwise sleep and recheck > > (and put some limit overall to prevent an endless loop) > > Miles > > On 2 September 2010 11:16, Hieu Hoang<[email protected]> wrote: >> sounds like a bad case of a network file system. you prob need to >> harass your sysadmin and try a few of these too >> http://fixunix.com/nfs/61890-forcing-nfs-sync.html >> >> On 02/09/2010 04:09, Suzy Howlett wrote: >>> Hi everyone, >>> >>> I'm running Moses through its experiment management system across a >>> cluster and I'm finding that sometimes jobs will finish successfully but >>> the .STDERR and .STDOUT files will be slow in appearing relative to the >>> .DONE file, meaning that the EMS concludes that the step crashed. I can >>> run the system again and it successfully reuses the results of the step >>> (it doesn't have to rerun the step) but this is becoming frustrating as >>> I have to restart the system >>> frequently. I tried adding a call to sleep() in the check_if_crashed() >>> method in experiment.perl but this is not helping in general - I think >>> sometimes the delay is as much as a couple of minutes. >>> >>> Has anyone else faced this problem, or have a better idea for how to get >>> around it? >>> >>> Cheers, >>> Suzy >>> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support >> > > > -- Suzy Howlett http://www.showlett.id.au/ _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
