On Wed, Oct/17/2007 07:45:53AM, Jeff Squyres wrote: > On Oct 16, 2007, at 6:36 PM, Ethan Mallove wrote: > > >>> The bail is that "make" will eventually succeed or fail > >>> with something other than "interrupted system call". Do > >>> we need another condition? > >> > >> I'm just worried that Sun's NFS will somehow get in a > >> perpetual "interrupted system call" loop such that you'll > >> never actually break out of it. > > > > How about a counter? E.g., after "x" number of "interrupted > > system call" messages, MTT moves on. Or should the "command > > restart" go in DoCommand.pm so we can have a timeout? > > Either or both of those would be fine (don't we have a timeout in > DoCommand.pm already?).
There is a timeout in DoCommand, but currently I keep reinvoking DoCommand on each "interrupted system call" so the timeout gets reset each time. This would not be the case if the do-while were to go in DoCommand. Then again, an infinite loop is certain in the case of a command that is *expected* to output "interrupted system call". -Ethan > > > I also noticed that our build script (which prints hundreds > > of "interrupted system call" messages per build, but does > > not seem to die because of them) uses "make -j 24" while MTT > > has been using "make -j 4". I'll experiment with -j. > > I know that Terry/Sun and co. spent a good amount of time trying to > solve the "interrupted system call" error -- they may have some more > information for you, such as how/why it happens...? > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > mtt-devel mailing list > mtt-de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel