On 16 Jan 2010, at 19:28, Adam Megacz wrote:
>
> Derrick Brashear <[email protected]> writes:
>> Instead, give it some arguments, like -i
>> /Library/Logs/DiagnosticReports/(path to kernel report).
>
> Ok,
Thanks.
So, as I said earlier, this is the bug that Derrick tracked down, and I fixed
earlier today. I thought it might be of use to say how we went about doing
this, given the information in the following ...
>
> Panic Date: Interval Since Last Panic Report: 472905 sec
> Kernel Version: Darwin Kernel Version 10.2.0: Tue Nov 3 10:37:10 PST
> 2009; root:xnu-1486.2.11~1/RELEASE_I386
> OpenAFS Version: org.openafs.filesystems.afs(1.4.12fc1)
> =============
> add symbol table from file
> "/tmp/afsdebugLAjeJl/org.openafs.filesystems.afs.sym"? 0x21b2bd
> <panic+445>: mov 0x8011d0,%eax
> 0x2a7ac2 <kernel_trap+1530>: jmp 0x2a7ade <kernel_trap+1558>
> 0x29d968 <lo_alltraps+712>: mov %edi,%esp
> 0x4607e500 <afs_GetDCache+7832>: mov 0x64(%edx),%ebx
So, this is the point where the panic actually occurred (all of the stuff above
is related to the trap mechanism). If we look at your earlier post - the raw
panic log, we see:
CR0: 0x8001003b, CR2: 0x00000064, CR3: 0x00101000, CR4: 0x000006e0
EAX: 0x00100000, EBX: 0x00000000, ECX: 0x460870e2, EDX: 0x00000000
CR2: 0x00000064, EBP: 0x34cabf1c, ESI: 0x0bff4004, EDI: 0xffffffff
EFL: 0x00010297, EIP: 0x4607e500, CS: 0x00000004, DS: 0x0000000c
So, at the time of the panic, EDX was probably NULL. So, the bug is a NULL
pointer dereference of some kind, and we're looking up something that's 0x64
bytes into the structure that's pointed at by EDX.
If we had a kernel with debugging symbols, then we could look directly at the C
source which corresponds to afs_GetDCache+7832. Unfortunately, we didn't have
one to hand, so Derrick improvised. By disassembling the current kernel module,
we can see exactly what code occurs at that location...
gdb --arch=i386 /var/db/openafs/etc/afs.kext/Contents/MacOS/afs
(gdb) disassemble afs_GetDCache
[ ... ]
0x00011500 <afs_GetDCache+7832>: mov 0x64(%edx),%ebx
0x00011503 <afs_GetDCache+7835>: call 0x0 <afs_atomlist_create>
0x00011508 <afs_GetDCache+7840>: cmp 0x0,%eax
0x0001150e <afs_GetDCache+7846>: je 0x1152c <afs_GetDCache+7876>
0x00011510 <afs_GetDCache+7848>: movl $0x8aa,0x8(%esp)
0x00011518 <afs_GetDCache+7856>: movl $0x8e438,0x4(%esp)
0x00011520 <afs_GetDCache+7864>: movl $0x8e080,(%esp)
0x00011527 <afs_GetDCache+7871>: call 0x642f8 <osi_AssertFailK>
0x0001152c <afs_GetDCache+7876>: movl $0x0,0x0
0x00011536 <afs_GetDCache+7886>: mov 0x0,%eax
0x0001153b <afs_GetDCache+7891>: mov %eax,(%esp)
0x0001153e <afs_GetDCache+7894>: call 0x0 <afs_atomlist_create>
0x00011543 <afs_GetDCache+7899>: mov %ebx,0x4(%esp)
0x00011547 <afs_GetDCache+7903>: mov -0xb8(%ebp),%ecx
0x0001154d <afs_GetDCache+7909>: mov %ecx,(%esp)
0x00011550 <afs_GetDCache+7912>: call 0x4df78 <rx_EndCall>
[ ... ]
(The call 0x0 is because we're debugging a module that isn't loaded. You can
get a dump that fills in the blanks for these by doing the following. When
kextutil asks for the load address, give it the one that the panic log said the
OpenAFS module was loaded at, in this case 0x4606c000
mkdir /tmp/symbols
kextutil -n -s /tmp/symbols /var/db/openafs/etc/afs.kext
cp -R /var/db/openafs/etc/afs.kext /tmp/symbols
gdb --arch=i386 /mach_kernel
(gdb) add_kext /tmp/symbols/afs.kext
(gdb) disassemble afs_GetDCache
)
We're relatively lucky here, as near the panic location is a call to
rx_EndCall(), the last of 2 that occur in that GetDCache. Lining this up with
the corresponding source code, the last call to rx_EndCall occurs at line 2219
of afs_dcache.c, which looks something like:
if (length > size) {
/* The fileserver told us it is going to send more data
* than we requested. It shouldn't do that, and
* accepting that much data can make us take up more
* cache space than we're supposed to, so error. */
code = rx_Error(tcall);
RX_AFS_GUNLOCK();
code1 = rx_EndCall(tcall, code);
RX_AFS_GLOCK();
tcall = (struct rx_call *)0;
code = EIO;
}
So, assuming that the compiler hasn't reordered our code, we've got rx_Error(),
followed by RX_GUNLOCK. It isn't possible for the compiler to reorder over a
lock, so it's pretty likely that the ordering is as shown. Both of these are
macros masquerading as functions. Looking at the simpler, rx_Error(), we have:
#define rx_Error(call) ((call)->error)
Now, if you recall, we panic'd because we were looking up something that has an
offset of
0x64 from the base of the structure. We can verify that 'error' is stored 0x64
bytes into the rx_call structure by visually examining the structure definition
or, if you've got a build tree from the same architecture, by running:
gdb src/libafs/MODLOAD/libafs.nonfs.o
(gdb) print &((struct rx_call*)0)->error
$2 = (afs_int32 *) 0x64
All of this points pretty conclusively at a NULL value for 'tcall'. Examining
the code, there is a situation where that can occur, if we get a bogus length
from the network, and as a result double free the rx call. It seems highly
likely that this was your bug.
Hopefully all of that is vaguely of use to folk wanting to know how these
things fit together - and a huge thanks to Derrick for all of the information
about how he did the disassembly steps.
Cheers,
Simon.
_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info