Rainer, I don't mean to pick on you, but I see this probabilistic argument all too frequently; I think it requires response. I'll readily concede that over the short-term while we are triaging bugs that a probabilistic argument, such as the one you make below, is perfectly reasonable. After all, on short time scales we can't solve all of the world's problems--our primary goal is to mitigate risk as best we can.
However, I think the purpose of this thread is strategic: to determine what should be in the code several releases from now. In my opinion, over these longer time scales, probabilistic arguments with regards to risk are not acceptable, as we have sufficient time to solve these problems in a manner that is both deterministic and provably correct. The fact that you haven't encountered any serious problems (regardless of the time span or number of machines) while running fast-restart or bitmap-later in your environment does absolutely nothing to disprove the existence of truly insidious failure modes. Furthermore, without any good means of comparing state space coverage, we can't even begin to infer a probability of failure at other sites from that experience. Further comments inline... On Fri, Jun 18, 2010 at 5:47 AM, Rainer Toebbicke <[email protected]> wrote: > Jeffrey Hutzelman schrieb: > >> >> Really, I consider enable-fast-restart to be extremely dangerous. >> It should have gone away long ago. >> >> I realize some people believe that speed is more important than not losing >> data, but I don't agree, and I don't think it's an appropriate position for >> a filesystem to take. Not losing your data is pretty much the defining >> difference between filesystems you can lose and filesystems from which you >> should run away screaming as fast as you can. I do not want people to run >> away screaming from OpenAFS, at any speed. >> > > I beg to disagree: the Volume/Vnode back-end has by no means the same > problems that a file system might have. Damages there will never wildly > destroy random items on disk, as you would have to be afraid using in a file > system. At least in namei, damages in a volume are entirely contained All this implies is each volume group is, in effect, its own little failure domain. Each of those failure domains is individually capable of being inconsistent following the crash. Moreover, each is individually capable of further corrupting itself, should it come online without an internal consistency check. I suppose one conclusion you could draw from this is that, due to failure isolation, infrequently modified volume groups are less likely to become further corrupted than frequently modified volume groups... > therein, files themselves are at the worst entirely replaced by others, > they're never corrupted partly other than being half-written or such. Of I beg to differ on this point: the fact that multiple vnodes may end up pointing to the same namei backing store in no way implies whole file-replacement. Partial corruption is an absolutely plausible failure mode. You can easily end up in this situation: * a couple of chunks get flushed over top of what used to be (and still is) another vnode's backing store * we create/delete a directory entry over top of what used to be another vnode's backing store > course files on disk can become unfindable or directories can have bogus > entries. > In general, you're making a probabilistic, and thus trivially disprovable, argument. While I'll readily concede that this sort of probabilistic argument is the foundation of risk analysis for most sites, I do not agree that OpenAFS, as a code vendor, should be in the business of supplying code whose correctness guarantees are wholly non-deterministic and probabilistic in nature. When you re-attach a volume following a crash without performing internal consistency checks, you're introducing non-determinism--because we don't have any form of journal--into the distributed system. >From the point of view of the administrator, I'll grant that a volume may just be a collection of vnodes. However, from the user's point of view, there is a complex semantic relationship between those files (and possibly to other wholly-unrelated entities within a distributed system). Breaking that structure in any way introduces non-determinism whereby rolling back to a sync point becomes Hard once you allow production operations to proceed from that point of inconsistency. Even if you're disciplined enough to schedule salvages for down periods, within a reasonable time-frame following the crash (thus attempting to mitigate Jeff's argument with regard to metadata corruption going unnoticed for long enough that backups expire), you've still introduced non-determinism into the distributed system, thus permitting corrupting ripple-effects. Reverting once trouble is uncovered is an extremely painful process. Reconstructing exactly what should be restored in order to make the distributed system consistent again, as I discussed above, requires deep understanding of the applications involved and the semantic relationships between them. Let's face it--most people just punt on this problem. While completely eliminating the non-determinism introduced by the crash is out of scope for this discussion, we can strive to minimize it by checking internal consistency before attempting to service any RPCs... [snip] > > For us, the delta does not justify keeping the service down for several > hours after a crash. Make that delta proportionally bigger by fixing the > other issues and I revise my statement. > Ok. That's a perfectly fair rationale. What I still don't understand is why people think _OpenAFS_ should strive to ship (and thus implicitly endorse) code that introduces such non-determinism (especially given that, as Andrew pointed out, under DAFS enabling fast-restart semantics will quite literally involve a 1-line out-of-tree, unsupported change)... Regards, -Tom _______________________________________________ OpenAFS-info mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-info
