On 07/08/2010 04:51 PM, John Hammond wrote: >> How about a network file system waiting for server failover >> (especially if it is not automatic)? > > That's not indefinite. The FS is waiting for something which will > eventually occur. (Assuming it's is correctly administered).
That IS indefinite. Indefinite just means that the limit is vague and/or unknown, as apposed to having a clear and well defined bound. In the IO context, any operation that is unbounded (indefinite) may take a very long time in human terms, and therefore should be interruptible. It is just not very reasonable to have a process stuck unkillable for days. But on the other hand, we don't want it timing out either if the data is valuable and we are willing to wait a day or two for hardware repairs. Even ignoring the fact that Lustre's behavior is allowed by the POSIX spec, I believe that Lustre is doing the Right Thing. If a job hangs on a write because servers are unavailable, it should hang indefinitely until the server is restored, or until a human interrupts the process with a signal. Only a human can determine how valuable that job's output is. If the human determines that the data is very important, and must finish, then they leave the job hanging and go fix the servers. If they decide that the data is easily reproducible and the compute cluster is better spent running another job that doesn't require the down filesystem, then they have the ability to abort the operation with a signal. Now all that said, there may be an argument to be made that SIGSTOP and SIGCONT should not be signals that interrupt Lustre client operations. _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
