On Thu, Mar 24, 2016 at 05:21:00PM +0100, Frederic URBAN wrote:
panic: kernel diagnostic assertion "sotoinpcb(inp->inp_socket) == inp"
failed: file "../../../../netinet/tcp_input.c", line 632
Stopped at? ? ? ? ? Debugger+0x9:? ? leave
? ? TID? ? ? PID? ? ? UID? ? ? ? PRFLAGS? ? ? ? PFLAGS? CPU?
COMMAND
? 40563? 40563? ? ? 515? ? ? ? ? ? ? 0x32? ? ? ? ? ? 0x80? ? ? 2?
squid
* 1402? ? 1402? ? ? ? ? 0? ? ? ? 0x14000? ? ? ? ? 0x210? ? ? 4?
softnet
Debugger() at Debugger+0x9
panic() at panic+0xfe
__assert() at __assert+0x25
tcp_input() at tcp_input+0x122c
ipv4_input() at ipv4_input+0x32e
ipintr() at ipintr+0x1e
netintr() at netintr+0x64
softintr_dispatch() at softintr_dispatch+0x8b
Xsoftnet() at Xsoftnet+0x1f
--- interrupt ---
end trace frame: 0x0, count: 6
taskq_thread+0x6c:
Interesting, I am trying to find and fix this bug for years. We
know that the pointers within the kernel are inconsistent when it
crashes. But it is unclear what caused the corruption.
? ? ? ? ? ? ? Very specific setup, squid + squidGuard + pf (pf.conf
attached) It was working under OpenBSD 5.4
I added this assertion in OpenBSD 5.5. The bug was there before,
but did not show up that clearly. Back then it paniced with some
use after free of the pcb.
----------------------------
revision 1.268
date: 2013/09/06 18:35:16; author: bluhm; state: Exp; lines: +3 -1;
In one core dump the pointers to socket, inpcb, tcpcb on the stack
of tcp_input() and tcp_output() were very inconsistent. Especially
the so->so_pcb is NULL which can only happen after the inp has been
detached. The whole issue looks similar to the old panic:
pool_do_get(inpcbpl): free list modified.
http://marc.info/?l=openbsd-bugs&m=132630237316970&w=2
To get more information, add some asserts that guarantee the
consistency of the socket, inpcb, tcpcb linking. They should trigger
when an inp is taken from the pcb hashes after it has been freed.
OK henning@
----------------------------
? ? ? ? ? ? ? This squid proxy is a transparent proxy using squid and
squidguard. pf divert packets to the lo interface.
This is simmilar to the setup of our customers where I saw the crash.
ddb{1}> mach ddbcpu 0x02
Stopped at? ? ? ? ? Debugger+0x9:? ? leave
Debugger() at Debugger+0x9
x86_ipi_handler() at x86_ipi_handler+0x76
Xresume_lapic_ipi() at Xresume_lapic_ipi+0x1c
--- interrupt ---
__mp_lock() at __mp_lock+0x48
__mp_acquire_count() at __mp_acquire_count+0x2b
mi_switch() at mi_switch+0x21e
sleep_finish() at sleep_finish+0xb1
tsleep() at tsleep+0x154
kqueue_scan() at kqueue_scan+0x138
sys_kevent() at sys_kevent+0x282
syscall() at syscall+0x368
--- syscall (number 72) ---
end of kernel
end trace frame: 0x7f7ffffd1758, count: 4
0xa9630355e9a:
Another process is waiting for kqueue. Not surprising. I have also
seen this with select.
ddb{3}> mach ddbcpu 0x04
Stopped at? ? ? ? ? Debugger+0x9:? ? leave
Debugger() at Debugger+0x9
panic() at panic+0xfe
__assert() at __assert+0x25
tcp_input() at tcp_input+0x122c
ipv4_input() at ipv4_input+0x32e
ipintr() at ipintr+0x1e
netintr() at netintr+0x64
softintr_dispatch() at softintr_dispatch+0x8b
Xsoftnet() at Xsoftnet+0x1f
--- interrupt ---
end trace frame: 0x0, count: 6
taskq_thread+0x6c:
And that is the CPU where it panics.
pass quick proto carp keep state (no-sync)
pass quick on sync proto pfsync keep state (no-sync)
I have seen this on machines without carp and without pfsync. So
I think it is not related.
pass in log quick on proxy inet proto tcp from <lan_networks> to any port
80 route-to lo0 divert-to 127.0.0.1 port 3128
This seems to be the rule that diverts all the traffic and causes
the trouble.
Thanks for the bug report. I am sorry that I have no solution for
you. I will continue thinking about it.
As a workaround you could try the following diff. Normally a pf
state is used to find a socket. This is used as speed optimization.
It is also necessary when using source transparent relays without
a divert-reply rule. Your setup should work without it, so try
this diff which disables it.
I would be interrested wether your setup still works without the
pf_inp_lookup(). And does this diff make the panic go away?
bluhm
Index: netinet/tcp_input.c
===================================================================
RCS file: /data/mirror/openbsd/cvs/src/sys/netinet/tcp_input.c,v
retrieving revision 1.315
diff -u -p -r1.315 tcp_input.c
--- netinet/tcp_input.c 21 Mar 2016 15:52:27 -0000 1.315
+++ netinet/tcp_input.c 26 Mar 2016 17:14:56 -0000
@@ -579,7 +579,7 @@ tcp_input(struct mbuf *m, ...)
/*
* Locate pcb for segment.
*/
-#if NPF > 0
+#if 0
inp = pf_inp_lookup(m);
#endif
findpcb: