Jeffrey Hutzelman schrieb:


Really, I consider enable-fast-restart to be extremely dangerous.
It should have gone away long ago.

I realize some people believe that speed is more important than not losing data, but I don't agree, and I don't think it's an appropriate position for a filesystem to take. Not losing your data is pretty much the defining difference between filesystems you can lose and filesystems from which you should run away screaming as fast as you can. I do not want people to run away screaming from OpenAFS, at any speed.


I beg to disagree: the Volume/Vnode back-end has by no means the same problems that a file system might have. Damages there will never wildly destroy random items on disk, as you would have to be afraid using in a file system. At least in namei, damages in a volume are entirely contained therein, files themselves are at the worst entirely replaced by others, they're never corrupted partly other than being half-written or such. Of course files on disk can become unfindable or directories can have bogus entries.

My experience is that damages to the vnode files usually result in directories containing inaccessible entries, in very rare occasions cross-linked files. The link table is surprisingly robust (even with its header overwritten).

I reckon that in over 15 years of AFS service we've probably had more bit errors in files due to uncaught memory errors and uncaught transmission errors, not speaking about the major culprit "programming errors", than nasty inconsistencies after crashes which complete and immediate salvaging would have caught.

We salvage volumes in the background at a low rate, and on file servers which never crash the logs show the same odd issues as on those who crashed, hence the added risk of running with potential damage is within the error bars. Even between salvages every now and then volumes drop out. The practical approach is to detect this quickly and re-salvage, and when the rate exceeds the pain threshold find a bug and fix it.

For us, the delta does not justify keeping the service down for several hours after a crash. Make that delta proportionally bigger by fixing the other issues and I revise my statement.

--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rainer Toebbicke
European Laboratory for Particle Physics(CERN) - Geneva, Switzerland
Phone: +41 22 767 8985       Fax: +41 22 767 7155
_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info

Reply via email to