On 16 Jan 2010, at 19:28, Adam Megacz wrote:

> 
> Derrick Brashear <[email protected]> writes:
>> Instead, give it some arguments, like -i
>> /Library/Logs/DiagnosticReports/(path to kernel report).
> 
> Ok,

Thanks.

So, as I said earlier, this is the bug that Derrick tracked down, and I fixed 
earlier today. I thought it might be of use to say how we went about doing 
this, given the information in the following ...

> 
> Panic Date:      Interval Since Last Panic Report:  472905 sec
> Kernel Version:  Darwin Kernel Version 10.2.0: Tue Nov  3 10:37:10 PST
> 2009; root:xnu-1486.2.11~1/RELEASE_I386
> OpenAFS Version: org.openafs.filesystems.afs(1.4.12fc1)
> =============
> add symbol table from file
> "/tmp/afsdebugLAjeJl/org.openafs.filesystems.afs.sym"? 0x21b2bd
> <panic+445>: mov    0x8011d0,%eax
> 0x2a7ac2 <kernel_trap+1530>:    jmp    0x2a7ade <kernel_trap+1558>
> 0x29d968 <lo_alltraps+712>:     mov    %edi,%esp
> 0x4607e500 <afs_GetDCache+7832>:        mov    0x64(%edx),%ebx

So, this is the point where the panic actually occurred (all of the stuff above 
is related to the trap mechanism). If we look at your earlier post - the raw 
panic log, we see:

CR0: 0x8001003b, CR2: 0x00000064, CR3: 0x00101000, CR4: 0x000006e0
EAX: 0x00100000, EBX: 0x00000000, ECX: 0x460870e2, EDX: 0x00000000
CR2: 0x00000064, EBP: 0x34cabf1c, ESI: 0x0bff4004, EDI: 0xffffffff
EFL: 0x00010297, EIP: 0x4607e500, CS:  0x00000004, DS:  0x0000000c

So, at the time of the panic, EDX was probably NULL. So, the bug is a NULL 
pointer dereference of some kind, and we're looking up something that's 0x64 
bytes into the structure that's pointed at by EDX.

If we had a kernel with debugging symbols, then we could look directly at the C 
source which corresponds to afs_GetDCache+7832. Unfortunately, we didn't have 
one to hand, so Derrick improvised. By disassembling the current kernel module, 
we can see exactly what code occurs at that location...

gdb --arch=i386 /var/db/openafs/etc/afs.kext/Contents/MacOS/afs
(gdb) disassemble afs_GetDCache
[ ... ]
0x00011500 <afs_GetDCache+7832>:        mov    0x64(%edx),%ebx
0x00011503 <afs_GetDCache+7835>:        call   0x0 <afs_atomlist_create>
0x00011508 <afs_GetDCache+7840>:        cmp    0x0,%eax
0x0001150e <afs_GetDCache+7846>:        je     0x1152c <afs_GetDCache+7876>
0x00011510 <afs_GetDCache+7848>:        movl   $0x8aa,0x8(%esp)
0x00011518 <afs_GetDCache+7856>:        movl   $0x8e438,0x4(%esp)
0x00011520 <afs_GetDCache+7864>:        movl   $0x8e080,(%esp)
0x00011527 <afs_GetDCache+7871>:        call   0x642f8 <osi_AssertFailK>
0x0001152c <afs_GetDCache+7876>:        movl   $0x0,0x0
0x00011536 <afs_GetDCache+7886>:        mov    0x0,%eax
0x0001153b <afs_GetDCache+7891>:        mov    %eax,(%esp)
0x0001153e <afs_GetDCache+7894>:        call   0x0 <afs_atomlist_create>
0x00011543 <afs_GetDCache+7899>:        mov    %ebx,0x4(%esp)
0x00011547 <afs_GetDCache+7903>:        mov    -0xb8(%ebp),%ecx
0x0001154d <afs_GetDCache+7909>:        mov    %ecx,(%esp)
0x00011550 <afs_GetDCache+7912>:        call   0x4df78 <rx_EndCall>
[ ... ]

(The call 0x0 is because we're debugging a module that isn't loaded. You can 
get a dump that fills in the blanks for these by doing the following. When 
kextutil asks for the load address, give it the one that the panic log said the 
OpenAFS module was loaded at, in this case 0x4606c000

   mkdir /tmp/symbols
   kextutil -n -s /tmp/symbols /var/db/openafs/etc/afs.kext
   cp -R /var/db/openafs/etc/afs.kext /tmp/symbols
   gdb --arch=i386 /mach_kernel
   (gdb) add_kext /tmp/symbols/afs.kext
   (gdb) disassemble afs_GetDCache
)

We're relatively lucky here, as near the panic location is a call to 
rx_EndCall(), the last of 2 that occur in that GetDCache. Lining this up with 
the corresponding source code, the last call to rx_EndCall occurs at line 2219 
of afs_dcache.c, which looks something like:

    if (length > size) {
        /* The fileserver told us it is going to send more data
         * than we requested. It shouldn't do that, and
         * accepting that much data can make us take up more
         * cache space than we're supposed to, so error. */
        code = rx_Error(tcall);
        RX_AFS_GUNLOCK();
        code1 = rx_EndCall(tcall, code);
        RX_AFS_GLOCK();
        tcall = (struct rx_call *)0;
        code = EIO;
    }

So, assuming that the compiler hasn't reordered our code, we've got rx_Error(), 
followed by RX_GUNLOCK. It isn't possible for the compiler to reorder over a 
lock, so it's pretty likely that the ordering is as shown. Both of these are 
macros masquerading as functions. Looking at the simpler, rx_Error(), we have:

    #define rx_Error(call)                  ((call)->error)

Now, if you recall, we panic'd because we were looking up something that has an 
offset of
0x64 from the base of the structure. We can verify that 'error' is stored 0x64 
bytes into the rx_call structure by visually examining the structure definition 
or, if you've got a build tree from the same architecture, by running:

gdb src/libafs/MODLOAD/libafs.nonfs.o
(gdb) print &((struct rx_call*)0)->error
$2 = (afs_int32 *) 0x64

All of this points pretty conclusively at a NULL value for 'tcall'. Examining 
the code, there is a situation where that can occur, if we get a bogus length 
from the network, and as a result double free the rx call. It seems highly 
likely that this was your bug.

Hopefully all of that is vaguely of use to folk wanting to know how these 
things fit together - and a huge thanks to Derrick for all of the information 
about how he did the disassembly steps.

Cheers,

Simon.

_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info

Reply via email to