On Thu, 12 Apr 2007, Derrick J Brashear wrote:

On Wed, 11 Apr 2007, Stephan Wiesand wrote:

One of our systems panicked two times within 2 hours yesterday, at the same location in the OpenAFS client. I attached the kernel's last words below.
[...]
I'm thinking about a patch. I have something else I need to deal with but I will try to work something up after. There's a 3rd possibility, namely the missing object being mishashed. We can presumably just instead of panicing iterate everything and dump state.

I suppose the other possibility would be to get a kernel crash dump but it's sort of cumbersome to move those around so unless you're comfortable with a debugger on a kernel dump that's probably a non-starter.

Got one:

# crash /boot/vmlinux-2.4.21-47.0.1.ELsmp vmcore
crash 4.0-2.29
[...]
crash> bt
PID: 1002   TASK: f49f2000  CPU: 2   COMMAND: "afs_cachetrim"
 #0 [f49f3cc8] netconsole_netdump at f8a1d793
 #1 [f49f3cdc] try_crashdump at c0129033
 #2 [f49f3cec] die at c010c6f2
 #3 [f49f3d00] do_page_fault at c0120389
 #4 [f49f3dc4] error_code (via page_fault) at c02b01c0
    EAX: 00000009  EBX: f8b5a000  ECX: 00000046  EDX: c0388e98  EBP: 00000002
    DS:  0068      ESI: f8c2dfa0  ES:  0068      EDI: 0005867a
    CS:  0060      EIP: f8a6da50  ERR: ffffffff  EFLAGS: 00010282
 #5 [f49f3e00] osi_Panic at f8a6da50
 #6 [f49f3e20] afs_HashOutDCache at f8a2d9ea
 #7 [f49f3e40] afs_GetDownD at f8a2d6a3
 #8 [f49f3fa0] afs_CacheTruncateDaemon at f8a2cd29
 #9 [f49f3fe0] afsd_thread at f8a7f9eb
#10 [f49f3ff0] kernel_thread_helper at c01095cb
crash>

Alas, I'm afraid this is the point where I'll need either some guidance or a lot of reading and experimenting to get any further.

NB:

During my previous attempt to make this happen, I got no panic but lots of messages about the cache [partition] being full, and that I should reduce the cache. However, the dedicated ext3 filesystem was neither full nor out of inodes, and I think the cachesize setting (70% of what's left of the filesystem after subtracting 32MB for the journal) is rather conservative.

When I tried to restart the client, I experienced what I've seen frequently with 1.4.x clients on this platform: "kernel BUG at slab.c:892:" when re-inserting the openafs module. This seems to happen quite consistently when restarting the client after it has run for some time (say, a week).

I have a crashdump from this incident as well. After a reboot, it took less than three hours to get the above panic.

I don't think it's a hardware problem, but if it helps I'd be willing to try and reproduce this on another system.

- Stephan

--
Stephan Wiesand
  DESY - DV -
  Platanenallee 6
  15738 Zeuthen, Germany

_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info

Reply via email to