On Thu, 12 Apr 2007, Stephan Wiesand wrote:

On Wed, 11 Apr 2007, Derrick J Brashear wrote:

On Wed, 11 Apr 2007, Stephan Wiesand wrote:

One of our systems panicked two times within 2 hours yesterday, at the same location in the OpenAFS client. I attached the kernel's last words below.

This is an SL3 system, kernel 2.4.21-47.0.1.ELsmp, i686. The client build has two patches on top of 1.4.4: linux-task-pointer-safety-20070320 from CVS, and the one from
https://lists.openafs.org/pipermail/openafs-devel/2007-March/014985.html
[]
so basically you appear to have an unhashed dcache entry. Either there's a locking bug or something is becoming erroneously unhashed.

How reproducible is it?

Good news: it is reproducible. The user confessed that he'd run "less than 20" parallel rsyncs transferring data to our cell. The files are a mixture af data and log files, with typical sizes of 15MB and 100kB.

So I set up a dozen rsyncs to copy this data into another volume, and after some 9 hours got the panic you find below.

I'm going to repeat this exercise now, and will also try to make the panic happen earlier (more rsyncs, read data from a faster source - any other
ideas?).

Just wondering what to do next then.

I'm thinking about a patch. I have something else I need to deal with but I will try to work something up after. There's a 3rd possibility, namely the missing object being mishashed. We can presumably just instead of panicing iterate everything and dump state.

I suppose the other possibility would be to get a kernel crash dump but it's sort of cumbersome to move those around so unless you're comfortable with a debugger on a kernel dump that's probably a non-starter.

Derrick
_______________________________________________
OpenAFS-info mailing list
[EMAIL PROTECTED]
https://lists.openafs.org/mailman/listinfo/openafs-info

Reply via email to