On Mon, 21 Oct 2013 15:18:06 +0100 Stephen Quinney <[email protected]> wrote:
> Has anyone else seen a kernel panic like this on EL6 with 1.6.5 and > kernel 2.6.32-358.14.1.el6? Or does anyone have any suggestions as to > what might have caused the problem? > > afs: disk cache read error in CacheItems slot 353815 off 28305220/36284420 > code -4/80 > openafs: assertion failed: tdc, file: In short: this means we got an error when reading from the cache fs. I assume that -4 is -EINTR, so that means we probably need to block signals on Linux when reading from the cache fs. (Or support getting interrupted by signals, but we don't do that now.) Some more info: Historically, the unix client hasn't really handled cache i/o errors at all. In various places a failed read or write from/to the cache would panic the machine, or error to userspace, or corrupt cache accounting information, etc etc. This is improving over time, and it's much better now than it has been in the past, but not all instances have been fixed yet. That particular error you saw occurs when reading a dcache slot from disk. In the past, an error like this would corrupt cache information, since the old code assumed that errors like this never happened. We've suspected that this is what's causing some other reports of crashes and small cache corruption involving dslot hash chain corruption; we didn't really _know_, since in those cases, the "problem" happened at some point in the past by the time the crash occurred. This dslot read error has been a candidate for the cause of those issues (and I believe, pretty much the only candidate that hasn't been otherwise ruled out). So we added a log message when that error occurs, and made the relevant function return an error. Various code paths were adjusted to try to handle the error as gracefully as possible, but some were more complex/difficult than others. A few, such as the specific backtrace you mentioned, have not hit 1.6 yet, since I was concerned about introducing new/different errors in the error handling since we didn't really know what was going on in these scenarios. Anyway, you're the first person I've seen actually hit this since the more informative log messages were introduced, so hooray! Now we finally know what's going on (in one scenario, at least). Developer references: Noticing the disk error was added in gerrit 7940, though many other commits have been changing it and fixing issues. The "easy" error-handling cases mentioned above are in 7941 and 9287. Some of the more "hard" cases are 8376, 8377, 8405, 8406. -- Andrew Deason [email protected] _______________________________________________ OpenAFS-info mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-info
