On Mon, Feb 18, 2013 at 15:45 +0100, Michael wrote:
> Hi all,
>
> after having a somewhat weird problem for a while now I hope someone can
> help me. _Sorry_ for the really lengthy mail but it is kind of complex
> to describe.
>
> dmesg and other information can be found at the end.
>
> The problem in short:
> Server keeps crashing hard (even ddb won't respond) after a more or less
> random time when using a GRE tunnel inside IPsec (transport mode).
>
> Elaboration:
> The setup consists of 3 OpenBSD boxes. One running OpenBSD 5.1 and the
> other two OpenBSD 5.2 (upgrading from 5.1 to 5.2 didn't fix the issue.
> Each box is directly connected to the internet with a public IP in a
> different physical location (1x US, 2x DE).
>
> All 3 boxes are connected with an IPsec tunnel, like a triangle (a<->b,
> b<->c, a<->c). Inside the IPsec tunnel is a GRE tunnel with OSPF on top
> for dynamic routing.
>
> Two of those systems got a softraid0 crypto partition running, the third
> one doesn't. (More on why that might be important later).
>
> When all 3 boxes are powered up everything is working perfectly fine,
> but after some random interval (can be minutes, can be days) one or two
> of the boxes crash, showing the ddb console but not letting me type
> anything in.
>
> When the 2 boxes are rebooted, the game starts anew.
>
> In case only one of the boxes crashed, I can by now predict a 99% change
> that the second one will crash shortly after the first one was fully
> rebooted.
>
> Now, it is only ever 1 or 2 boxes that crash and so far it never has
> been the box WITHOUT the softraid0 crypto volume.
>
> Out of curiosity I also created a crypto volume on the third box and put
> it to some use (squid cache parition) and sure enough, now the third box
> sometimes crashed too.
>
> When doing some tests (with only having two systems using a crypto
> partition) I also noticed that there are no crashes at all if there is
> only a single IPsec tunnel active between two of the boxes (one box with
> crypto partition, the other without) and GRE encrypted inside and the
> other GRE tunnels are unencrypted.
>
> To not play around too much with the production systems I tried to
> replicate the issue with 3 VirtualBox VM and the latest OpenBSD 5.3
> snapshot, but VirtualBox instantly throws a GURU MEDITATION ERROR
> whenever I try to push a file (1 MB is enough) over an encrypted GRE
> tunnel using scp or netcat from one machine to the other. When I turn of
> IPsec, the transfer works, no crashing.
>
> I only have console access to two of the boxes and whenever a system
> crashes, it displays a very short message which is always a little
> different, the only consistent part is the mentioning of "Stopped at
> __mp_lock". That system is running OpenBSD 5.1, bsd.mp.
>
> I hope someone has an idea what might be going on.
>
> Thanks,
> Michael
>
> PS: If someone wants to play around with my three VirtualBox test VMs
> you can download them here:
>
> https://ssl.bsdhost.eu/owncloud/public.php?service=files&t=17b7472546546a617a3358ef9d953a4c
>
hi,
there appears to be some spls missing in net/if_gre.c code.
netinet/ip_gre.c looks almose sane (gre_input should be
called at splsoftnet from ip_input), yet gre_usrreq calls
rip_usrreq that doesn't do splsoftnet itself (yuck!).
reminds of the recent raw_usrreq change. anyways, both
diffs are attached. please try them and see if they help
out.
cheers
diff --git sys/net/if_gre.c sys/net/if_gre.c
index 7a9eeee..84f0f0e 100644
--- sys/net/if_gre.c
+++ sys/net/if_gre.c
@@ -679,12 +679,15 @@ void
gre_keepalive(void *arg)
{
struct gre_softc *sc = arg;
+ int s;
if (!sc->sc_ka_timout)
return;
sc->sc_ka_state = GRE_STATE_DOWN;
+ s = splnet();
gre_link_state(sc);
+ splx(s);
}
void
@@ -747,6 +750,8 @@ gre_send_keepalive(void *arg)
void
gre_recv_keepalive(struct gre_softc *sc)
{
+ int s;
+
if (!sc->sc_ka_timout)
return;
@@ -762,7 +767,9 @@ gre_recv_keepalive(struct gre_softc *sc)
case GRE_STATE_HOLD:
if (--sc->sc_ka_holdcnt < 1) {
sc->sc_ka_state = GRE_STATE_UP;
+ s = splnet();
gre_link_state(sc);
+ splx(s);
}
break;
case GRE_STATE_UP:
diff --git sys/netinet/raw_ip.c sys/netinet/raw_ip.c
index 61285a8..050529f 100644
--- sys/netinet/raw_ip.c
+++ sys/netinet/raw_ip.c
@@ -396,7 +396,7 @@ int
rip_usrreq(struct socket *so, int req, struct mbuf *m, struct mbuf *nam,
struct mbuf *control, struct proc *p)
{
- int error = 0;
+ int s, error = 0;
struct inpcb *inp = sotoinpcb(so);
#ifdef MROUTING
extern struct socket *ip_mrouter;
@@ -410,6 +410,7 @@ rip_usrreq(struct socket *so, int req, struct mbuf *m,
struct mbuf *nam,
goto release;
}
+ s = splsoftnet();
switch (req) {
case PRU_ATTACH:
@@ -532,6 +533,7 @@ rip_usrreq(struct socket *so, int req, struct mbuf *m,
struct mbuf *nam,
/*
* stat: don't bother with a blocksize.
*/
+ splx(s);
return (0);
/*
@@ -556,6 +558,7 @@ rip_usrreq(struct socket *so, int req, struct mbuf *m,
struct mbuf *nam,
default:
panic("rip_usrreq");
}
+ splx(s);
release:
if (m != NULL)
m_freem(m);