>>>> [ ... whether apps can rely on the kernel always returning a >>>> full read or write count on file IO except at EOF on read ... ]
As it has been remarked the answer is NO. BTW this question is not the same as the "interruptible" one. There is a difference between the kernel being allowed to read(2) or write(2) less than requested by the process and them being apparently atomic. The kernel may always return a count of bytes read or written less than requested for any reason whatever, even if a signal has not interrupted the operations. The applications have to deal with it. Most applications are written wrong (and in many other ways, e.g. how many do check or even just '(void)' the return code from 'close' for example, and never mind not calling 'flock' or 'fsync'), and as many kernel writers say "userspace sucks", but most applications mistakes matter only in infrequent cases, and when these happen users just shrug. The reasons why the semantics are like that have been explained very clearly by Gabriel in his paper "Worse is better". >>> How about a network file system waiting for server failover >>> (especially if it is not automatic)? >> That's not indefinite. The FS is waiting for something which >> will eventually occur. Here "indefinite" as to a wait duration is used in two rather different ways. One is to say that it is "unknowable", the other is that it is "unknown at the moment and expected to be in some relevant sense long". There is a fundamental difference. If the outcome (success or failure) of an operation may or may not become known, we have a completely different class of models of computation from the usual Turing or Church or Von Neumann one, with rather completely different properties from the usual one. The halting problem does not exist, as all computations must complete, but the outcome on completion can be undeterminate, which the opposite of the usual class of models of computation). Once upon time I even wrote a paper (in a very obscure journal) on the difference between the two classes of models of computation and why it matters a lot. In the distributed filesystem case one is trying to simulate one class of model of computation on another, which is simply not possible in the edge cases (those which matter). Attempting POSIX semantics in that case requires a lot of effort and a considerable suspension of disbelief. > (Assuming it's is correctly administered). That's the key statement -- here the hidden assumption is that "correctly administered" means that there is a central agency that ensures that all operations have a known outcome if they complete. If there is no central agency, all operations complete because they eventually timeout, but whether they succeeded or not is not always knowable. > That IS indefinite. Indefinite just means that the limit is > vague and/or unknown, as apposed to having a clear and well > defined bound. Actually that applies only to the non distributed case. In the distributed case it means that the outcome may be absolutely unknowable. Supposed for example that you write a log entry to a file on a Lustre file server, and the kernel code receives confirmation that the write request has been sent, but then all communication with the file server ceases. Has the log entry been written to the file server disk? Well, how can you figure that out? No way (unless an admin looks at the file server and thus restarts communications). That's "indefinite": when whether the operation succeeded or failed cannot be known. [ ... ] > If a job hangs on a write because servers are unavailable, it > should hang indefinitely until the server is restored, or > until a human interrupts the process with a signal. That is if one wants to preserve the illusion that a centralized class model of computation is available when a distributed class one is the reality. Then the human interruption is the point at which the illusion goes away. The better way to handle the distributed case is to design programs knowingly for the class of distributed models of computation, which requires completely different programming strategies, and of course nearly everybody does not realize that (even if some programmers of distributed system with high reality requirements rediscover them). > Only a human can determine how valuable that job's output is. > If the human determines that the data is very important, and > must finish, then they leave the job hanging and go fix the > servers. If they decide that the data is easily reproducible > and the compute cluster is better spent running another job > that doesn't require the down filesystem, then they have the > ability to abort the operation with a signal. [ ... ] Here you are assuming though the underlying model of computation is in the centralized class, that is the outcome (success or failure) of an operation is knowable, even if only by the human component of the systems. For Lustre systems this is usually a good assumption as most Lustre installations are centrally managed and in a single location and hardware and software state can and will be inspected to determine the outcome or operations. But people are now using Lustre across wide geographical networks and over mobs of thousands (or dozens of thousands) of clients and servers, and in such cases it is usually not practical to assume that the outcome of an operation is knowable, and eventually people will learn that means a completely different world. Or not, as the 'O_PONIES' story about 'fsync' and barriers demonstrates. _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
