On Tue, Oct 29, 2019 at 01:41:18PM +0000, Dmitry Vovkula wrote:
> Hello.
>
> We found the following problem: in case of unavailability (problem with the
> communication channel) between routers (on which OpenBSD 6.6 is installed),
> after the BGP hold timer expires, the bgp neighborhood drops and then the
> bgpd daemon stops on all routers. The situation also repeats when two other
> routers that are located on the same site are unavailable.
> If I execute the command to check the BGP neighborship - I get a response
> that the bgpd daemon is not running:
> bgpctl show summary
> bgpctl: connect: /var/run/bgpd.sock.0: Connection refused
>
> After i start the bgpd daemon the neighborhood rises: rcctl start bgpd.
>
> We have the following topology. Routers are indicated on the topology as
> 1,2,3,4,5,6ю On all routers the OpenBSD 6.6 operating system is running.
>
> 1---------------3
> | \ /|
> | 2-----------4 |
> | \ / |
> | ----6---- |
> | | |
> +-------5------+
>
> Routers are connected by GRE interfaces: 1 and 3, 1 and 5, 3 and 5, 2 and 4,
> 2 and 6, 4 and 6.
> Routers are connected by vlan interfaces: 1 and 2, 3 and 4, 5 and 6.
> Routers that are installed on the one site: 1 and 2, 3 and 4, 5 and 6.
> On all routers the bird daemon was run with OSPF on the gre, vlan, and
> loopback interfaces to organize the availability of loopback interfaces.
> Also, the ldpd daemon was run on interfaces GRE and vlan to generate MPLS
> labels.
> Also, the bgpd daemon was run on the loopback interfaces for MPLS L3 VPN.
> Routers 2, 4, 6 are configured as bgp route reflectors and organize one
> cluster with the same cluster id.
>
> In the logs, we observe the following messages (its logs from router 6):
>
> Oct 24 13:53:02 router6 bgpd[22410]: peer closed imsg connection
> Oct 24 13:53:02 router6 bgpd[22410]: main: Lost connection to RDE
> Oct 24 13:53:02 router6 bgpd[68713]: peer closed imsg connection
> Oct 24 13:53:02 router6 bgpd[22410]: kernel routing table 99 (VRF_99)
> decoupled
> Oct 24 13:53:02 router6 bgpd[68713]: SE: Lost connection to RDE
> Oct 24 13:53:02 router6 bgpd[22410]: kernel routing table 0 (Loc-RIB)
> decoupled
> Oct 24 13:53:02 router6 bgpd[68713]: peer closed imsg connection
> Oct 24 13:53:02 router6 bgpd[68713]: SE: Lost connection to RDE control
> Oct 24 13:53:02 router6 bgpd[47521]: Rib Loc-RIB: neighbor 10.0.0.6 (LOCAL)
> AS64500: withdraw announce 192.168.51.0/24
> Oct 24 13:53:02 router6 bgpd[68713]: peer closed imsg connection
> Oct 24 13:53:02 router6 bgpd[68713]: SE: Lost connection to parent
> Oct 24 13:53:02 router6 bgpd[47521]: Rib Loc-RIB: neighbor 10.0.0.6 (LOCAL)
> AS64500: withdraw announce 172.16.17.0/24
> Oct 24 13:53:02 router6 bgpd[68713]: neighbor 10.0.0.2 : sending
> notification: Cease, administratively down
> Oct 24 13:53:02 router6 bgpd[22410]: route decision engine terminated; signal
> 10
> Oct 24 13:53:02 router6 bgpd[68713]: neighbor 10.0.0.2 : state change
> Established -> Idle, reason: Stop
> Oct 24 13:53:02 router6 bgpd[68713]: neighbor 10.0.0.1 : sending
> notification: Cease, administratively down
> Oct 24 13:53:02 router6 bgpd[68713]: neighbor 10.0.0.1 : state change
> Established -> Idle, reason: Stop
> Oct 24 13:53:02 router6 bgpd[68713]: neighbor 10.0.0.3 : sending
> notification: Cease, administratively down
> Oct 24 13:53:02 router6 bgpd[68713]: neighbor 10.0.0.3 : state change
> Established -> Idle, reason: Stop
> Oct 24 13:53:02 router6 bgpd[68713]: neighbor 10.0.0.5 : sending
> notification: Cease, administratively down
> Oct 24 13:53:02 router6 bgpd[68713]: neighbor 10.0.0.5 : state change
> Established -> Idle, reason: Stop
> Oct 24 13:53:02 router6 bgpd[68713]: session engine exiting
> Oct 24 13:53:02 router6 bgpd[22410]: terminating
>
> Example bgp configuration of router 1. The configuration of BGP routers 3 and
> 5 is similar to the configuration of router 1.
>
> AS 64500
> router-id 10.0.0.1
> holdtime 60
> connect-retry 1
> log updates
> listen on 10.0.0.1
>
> vpn "VRF_99" on mpe99 {
> rd 64500:99
> export-target rt 64500:99
> import-target rt 64500:99
> network inet priority 25 set metric 30
> }
>
> group "RR-servers" {
> remote-as 64500
> local-address 10.0.0.1
> announce IPv4 unicast
> announce IPv4 vpn
>
> neighbor 10.0.0.2 {
>
> }
> neighbor 10.0.0.4 {
>
> }
> neighbor 10.0.0.6 {
>
> }
> }
>
> allow from any
> allow to any
>
> Example bgp configuration of router 2. The configuration of BGP routers 4 and
> 6 is similar to the configuration of router 2.
>
> AS 64500
> router-id 10.0.0.2
> holdtime 60
> connect-retry 1
> log updates
> listen on 10.0.0.2
>
> vpn "VRF_99" on mpe99 {
> rd 64500:99
> export-target rt 64500:99
> import-target rt 64500:99
> network inet priority 25 set metric 50
> }
>
> group "RR-servers" {
> remote-as 64500
> local-address 10.0.0.2
> announce IPv4 unicast
> announce IPv4 vpn
>
> neighbor 10.0.0.4 {
>
> }
> neighbor 10.0.0.6 {
>
> }
> }
>
> group "RR-clients" {
> remote-as 64500
> local-address 10.0.0.2
> announce IPv4 unicast
> announce IPv4 vpn
> route-reflector 10.0.0.0
>
> neighbor 10.0.0.1 {
>
> }
> neighbor 10.0.0.3 {
>
> }
> neighbor 10.0.0.5 {
>
> }
> }
>
> allow from any
> allow to any
>
> On the previous version of OpenBSD 6.5, there was no such problem.
> Can you tell me how to fix the problem?
>
A fix for this crash did go into -current today and a syspatch with this
and other fixes will be provided once we're confident that this indeed
fixes all problems currently seen.
I included the patch for your issue.
--
:wq Claudio
Index: rde_rib.c
===================================================================
RCS file: /cvs/src/usr.sbin/bgpd/rde_rib.c,v
retrieving revision 1.207
diff -u -p -r1.207 rde_rib.c
--- rde_rib.c 27 Sep 2019 14:50:39 -0000 1.207
+++ rde_rib.c 28 Oct 2019 08:59:51 -0000
@@ -1777,13 +1777,15 @@ nexthop_update(struct kroute_nexthop *ms
if (nexthop_unref(nh))
return; /* nh lost last ref, no work left */
- if (nh->next_prefix)
+ if (nh->next_prefix) {
/*
* If nexthop_runner() is not finished with this nexthop
* then ensure that all prefixes are updated by setting
* the oldstate to NEXTHOP_FLAPPED.
*/
nh->oldstate = NEXTHOP_FLAPPED;
+ TAILQ_REMOVE(&nexthop_runners, nh, runner_l);
+ }
if (msg->connected) {
nh->flags |= NEXTHOP_CONNECTED;
@@ -1855,8 +1857,12 @@ nexthop_unlink(struct prefix *p)
if (p->nexthop == NULL || (p->flags & PREFIX_NEXTHOP_LINKED) == 0)
return;
- if (p == p->nexthop->next_prefix)
+ if (p == p->nexthop->next_prefix) {
p->nexthop->next_prefix = LIST_NEXT(p, entry.list.nexthop);
+ /* remove nexthop from list if no prefixes left to update */
+ if (p->nexthop->next_prefix == NULL)
+ TAILQ_REMOVE(&nexthop_runners, p->nexthop, runner_l);
+ }
p->flags &= ~PREFIX_NEXTHOP_LINKED;
LIST_REMOVE(p, entry.list.nexthop);