Lawrence Stewart <[email protected]> wrote: > On 06/20/10 21:15, Fabian Keil wrote: > > Lawrence Stewart<[email protected]> wrote: > > > >> On 06/20/10 03:58, Fabian Keil wrote: > >>> Lawrence Stewart<[email protected]> wrote: > >>> > >>>> On 06/13/10 18:12, Lawrence Stewart wrote: > >>> > >>>>> The time has come to solicit some external testing for my SIFTR tool. > >>>>> I'm hoping to commit it within a week or so unless problems are > >>>>> discovered. > >>> > >>>>> I'm interested in all feedback and reports of success/failure, along > >>>>> with details of the architecture tested and number of CPUs if you would > >>>>> be so kind. > >>> > >>> I got the following hand-transcribed panic maybe a second after > >>> sysctl net.inet.siftr.enabled=1 > >>> > >>> Fatal trap 12: page fault while in kernel mode > >>> cpuid = 1; apic id = 01 > >>> [...] > >>> current process = 12 (swi4: clock) > >>> [ thread pid 12 tid 100006 ] > >>> Stopped at siftr_chkpkt+0xd0: addq $0x1,0x8(%r14) > >>> db> where > >>> Tracing pid 12 tid 100006 td 0xffffff00034037e0 > >>> siftr_chkpt() at siftr_chkpkt+0xd0 > >>> pfil_run_hooks() at pfil_run_hooks+0xb4 > >>> ip_output() at ip_output+0x382 > >>> tcp_output() tcp_output+0xa41 > >>> tcp_timer_rexmt() at tcp_timer_rexmt+0x251 > >>> softclock() at softclock+0x291 > >>> intr_event_execute_handlers() at intr_event_execute_handlers+0x66 > >>> ithread_loop at ithread_loop+0x8e > >>> fork_exit() at fork_exit+0x112 > >>> fork_trampoline() at fork_trampoline+0xe > >>> --- trap 0, rip = 0, rsp = 0xffffff800003ad30, rbp = 0 --- > >> > >> So I've tracked down the line of code where the page fault is occurring: > >> > >> if (dir == PFIL_IN) > >> ss->n_in++; > >> else > >> ss->n_out++; > >> > >> ss is a DPCPU (dynamic per-cpu) variable used to keep a set of stats > >> per-cpu and is initialised at the start of the function like so: > >> > >> ss = DPCPU_PTR(ss); > >> > >> So for ss to be NULL, that implies DPCPU_PTR() is returning NULL on your > >> machine. I know very little about the inner workings of the DPCPU_* > >> macros, but I'm pretty sure the way I use them in SIFTR is correct or at > >> least as intended. > > > > siftr_chkpkt() passes ss to siftr_chkreinject() before dereferencing > > it itself. I think if ss was NULL, the panic should already occur in > > siftr_chkreinject(). > > Yes but siftr_chkreinject() only dereferences ss in the exceptional case > of a malloc failure or duplicate pkt. It's unlikely either case happens > for you and so wouldn't trigger the panic. > > > To be sure I added: > > > > diff --git a/sys/netinet/siftr.c b/sys/netinet/siftr.c > > index 8bc3498..b9fdfe4 100644 > > --- a/sys/netinet/siftr.c > > +++ b/sys/netinet/siftr.c > > @@ -788,6 +788,16 @@ siftr_chkpkt(void *arg, struct mbuf **m, struct ifnet > > *ifp, int dir, > > if (siftr_chkreinject(*m, dir, ss)) > > goto ret; > > > > + if (ss == NULL) { > > + printf("ss is NULL"); > > + ss = DPCPU_PTR(ss); > > + if (ss == NULL) { > > + printf("ss is still NULL"); > > + goto ret; > > + } > > + } > > + > > + > > if (dir == PFIL_IN) > > ss->n_in++; > > else > > > > which doesn't seem to affect the problem. > > As in it still panics and the "ss is NULL" message is not printed? I > would have expected to at least see "ss is NULL" printed if my > hypothesis was correct... hmm.
Yes, it still panics, but no message is printed.
> Perhaps the way I discovered the line number at which the panic occurred
> was wrong. I compiled SIFTR on my amd64 dev server with "CFLAGS+=-g" in
> the SIFTR Makefile to get debug symbols, ran "objdump -Sd siftr.ko | vim
> -", searched for the instruction reported in the panic message i.e.
> "addq $0x1,0x8(%r14)" and then with a bit of trial and error, recompiled
> SIFTR with the line of code "volatile int blah = 0; blah = 2;" at
> various points in the function and looking at the change in the objdump
> output to pinpoint which line of C code corresponded with the "addq"
> instruction.
>
> The "volatile int blah = 0; blah = 2;" compiles to "movl
> $0x0,0xffffffffffffffd4(%rbp)" followed immediately by "movl
> $0x2,0xffffffffffffffd4(%rbp)". When I put that code above the "if (dir
> == PFIL_IN)" statement I see the objdump output show the assembly code
> before the "addq" instruction and when I move it after the if statement
> the assembly code moves after the "addq" instruction.
That's a neat trick.
> Perhaps you could reproduce the above procedure and see if you identify
> the same point in the siftr_chkpkt function I did for the instruction
> referenced by the panic message?
I do. Using:
diff --git a/sys/netinet/siftr.c b/sys/netinet/siftr.c
index b9fdfe4..fc6bd9a 100644
--- a/sys/netinet/siftr.c
+++ b/sys/netinet/siftr.c
@@ -797,12 +797,15 @@ siftr_chkpkt(void *arg, struct mbuf **m, struct ifnet
*ifp, int dir,
}
}
+ volatile int blah = 0; blah = 2;
if (dir == PFIL_IN)
ss->n_in++;
else
ss->n_out++;
+ volatile int foo = 0; foo = 3;
+
/*
* Create a tcphdr struct starting at the correct offset
* in the IP packet. ip->ip_hl gives the ip header length
I get:
803: c7 45 d4 00 00 00 00 movl $0x0,0xffffffffffffffd4(%rbp)
80a: c7 45 d4 02 00 00 00 movl $0x2,0xffffffffffffffd4(%rbp)
811: 0f 84 8a 02 00 00 je aa1 <siftr_chkpkt+0x371>
817: 49 83 46 08 01 addq $0x1,0x8(%r14)
81c: c7 45 d0 00 00 00 00 movl $0x0,0xffffffffffffffd0(%rbp)
823: c7 45 d0 03 00 00 00 movl $0x3,0xffffffffffffffd0(%rbp)
> >> Could you please go ahead and retest using a GENERIC kernel and see if
> >> you can reproduce? There could be something in your custom kernel
> >> causing the offsets or linker set magic used by the DPCPU bits to break
> >> which in turn is triggering this panic in SIFTR.
> >
> > I'll retry without pf first, and with GENERIC afterwards.
>
> Sounds good, thanks.
Taking pf (and altq) out of the picture doesn't seem to make
a difference.
Fabian
signature.asc
Description: PGP signature
