On Tue, Oct 29, 2019 at 01:41:18PM +0000, Dmitry Vovkula wrote:
> Hello.
> 
> We found the following problem: in case of unavailability (problem with the 
> communication channel) between routers (on which OpenBSD 6.6 is installed), 
> after the BGP hold timer expires, the bgp neighborhood drops and then the 
> bgpd daemon stops on all routers. The situation also repeats when two other 
> routers that are located on the same site are unavailable.
> If I execute the command to check the BGP neighborship - I get a response 
> that the bgpd daemon is not running:
> bgpctl show summary
> bgpctl: connect: /var/run/bgpd.sock.0: Connection refused
> 
> After i start the bgpd daemon the neighborhood rises: rcctl start bgpd.
> 
> We have the following topology. Routers are indicated on the topology as 
> 1,2,3,4,5,6ю On all routers the OpenBSD 6.6 operating system is running.
> 
> 1---------------3
> | \                    /| 
> |  2-----------4  |
> |     \            /    |
> |      ----6----    |
> |             |          |
> +-------5------+                                       
> 
> Routers are connected by GRE interfaces: 1 and 3, 1 and 5, 3 and 5, 2 and 4, 
> 2 and 6, 4 and 6.
> Routers are connected by vlan interfaces: 1 and 2, 3 and 4, 5 and 6.
> Routers that are installed on the one site: 1 and 2, 3 and 4, 5 and 6.
> On all routers the bird daemon was run with OSPF on the gre, vlan, and 
> loopback interfaces to organize the availability of loopback interfaces.
> Also, the ldpd daemon was run on interfaces GRE and vlan to generate MPLS 
> labels.
> Also, the bgpd daemon was run on the loopback interfaces for MPLS L3 VPN.
> Routers 2, 4, 6 are configured as bgp route reflectors and organize one 
> cluster with the same cluster id.
> 
> In the logs, we observe the following messages (its logs from router 6):
> 
> Oct 24 13:53:02 router6 bgpd[22410]: peer closed imsg connection
> Oct 24 13:53:02 router6 bgpd[22410]: main: Lost connection to RDE
> Oct 24 13:53:02 router6 bgpd[68713]: peer closed imsg connection
> Oct 24 13:53:02 router6 bgpd[22410]: kernel routing table 99 (VRF_99) 
> decoupled
> Oct 24 13:53:02 router6 bgpd[68713]: SE: Lost connection to RDE
> Oct 24 13:53:02 router6 bgpd[22410]: kernel routing table 0 (Loc-RIB) 
> decoupled
> Oct 24 13:53:02 router6 bgpd[68713]: peer closed imsg connection
> Oct 24 13:53:02 router6 bgpd[68713]: SE: Lost connection to RDE control
> Oct 24 13:53:02 router6 bgpd[47521]: Rib Loc-RIB: neighbor 10.0.0.6 (LOCAL) 
> AS64500: withdraw announce 192.168.51.0/24
> Oct 24 13:53:02 router6 bgpd[68713]: peer closed imsg connection
> Oct 24 13:53:02 router6 bgpd[68713]: SE: Lost connection to parent
> Oct 24 13:53:02 router6 bgpd[47521]: Rib Loc-RIB: neighbor 10.0.0.6 (LOCAL) 
> AS64500: withdraw announce 172.16.17.0/24
> Oct 24 13:53:02 router6 bgpd[68713]: neighbor 10.0.0.2 : sending 
> notification: Cease, administratively down
> Oct 24 13:53:02 router6 bgpd[22410]: route decision engine terminated; signal 
> 10
> Oct 24 13:53:02 router6 bgpd[68713]: neighbor 10.0.0.2 : state change 
> Established -> Idle, reason: Stop
> Oct 24 13:53:02 router6 bgpd[68713]: neighbor 10.0.0.1 : sending 
> notification: Cease, administratively down
> Oct 24 13:53:02 router6 bgpd[68713]: neighbor 10.0.0.1 : state change 
> Established -> Idle, reason: Stop
> Oct 24 13:53:02 router6 bgpd[68713]: neighbor 10.0.0.3 : sending 
> notification: Cease, administratively down
> Oct 24 13:53:02 router6 bgpd[68713]: neighbor 10.0.0.3 : state change 
> Established -> Idle, reason: Stop
> Oct 24 13:53:02 router6 bgpd[68713]: neighbor 10.0.0.5 : sending 
> notification: Cease, administratively down
> Oct 24 13:53:02 router6 bgpd[68713]: neighbor 10.0.0.5 : state change 
> Established -> Idle, reason: Stop
> Oct 24 13:53:02 router6 bgpd[68713]: session engine exiting
> Oct 24 13:53:02 router6 bgpd[22410]: terminating
> 
> Example bgp configuration of router 1. The configuration of BGP routers 3 and 
> 5 is similar to the configuration of router 1.
> 
> AS 64500
> router-id 10.0.0.1
> holdtime 60
> connect-retry 1
> log updates
> listen on 10.0.0.1
> 
> vpn "VRF_99" on mpe99 {
>         rd 64500:99
>         export-target rt 64500:99
>         import-target rt 64500:99
>         network inet priority 25 set metric 30
> }
> 
> group "RR-servers" {
>         remote-as 64500
>         local-address 10.0.0.1
>         announce IPv4 unicast
>         announce IPv4 vpn
> 
>         neighbor 10.0.0.2 {
>         
>       }
>         neighbor 10.0.0.4 {
> 
>         }
>         neighbor 10.0.0.6 {
> 
>         }
> }
> 
> allow from any
> allow to any
> 
> Example bgp configuration of router 2. The configuration of BGP routers 4 and 
> 6 is similar to the configuration of router 2.
> 
> AS 64500
> router-id 10.0.0.2
> holdtime 60
> connect-retry 1
> log updates
> listen on 10.0.0.2
> 
> vpn "VRF_99" on mpe99 {
>         rd 64500:99
>         export-target rt 64500:99
>         import-target rt 64500:99
>         network inet priority 25 set metric 50
> }
> 
> group "RR-servers" {
>         remote-as 64500
>         local-address 10.0.0.2
>         announce IPv4 unicast
>         announce IPv4 vpn
> 
>         neighbor 10.0.0.4 {
> 
>         }
>         neighbor 10.0.0.6 {
> 
>         }
> }
> 
> group "RR-clients" {
>         remote-as 64500
>         local-address 10.0.0.2
>         announce IPv4 unicast
>         announce IPv4 vpn
>         route-reflector 10.0.0.0
> 
>         neighbor 10.0.0.1 {
> 
>         }
>         neighbor 10.0.0.3 {
> 
>         }
>         neighbor 10.0.0.5 {
> 
>         }
> }
> 
> allow from any
> allow to any
> 
> On the previous version of OpenBSD 6.5, there was no such problem.
> Can you tell me how to fix the problem?
> 

A fix for this crash did go into -current today and a syspatch with this
and other fixes will be provided once we're confident that this indeed
fixes all problems currently seen.

I included the patch for your issue.
-- 
:wq Claudio

Index: rde_rib.c
===================================================================
RCS file: /cvs/src/usr.sbin/bgpd/rde_rib.c,v
retrieving revision 1.207
diff -u -p -r1.207 rde_rib.c
--- rde_rib.c   27 Sep 2019 14:50:39 -0000      1.207
+++ rde_rib.c   28 Oct 2019 08:59:51 -0000
@@ -1777,13 +1777,15 @@ nexthop_update(struct kroute_nexthop *ms
                if (nexthop_unref(nh))
                        return;         /* nh lost last ref, no work left */
 
-       if (nh->next_prefix)
+       if (nh->next_prefix) {
                /*
                 * If nexthop_runner() is not finished with this nexthop
                 * then ensure that all prefixes are updated by setting
                 * the oldstate to NEXTHOP_FLAPPED.
                 */
                nh->oldstate = NEXTHOP_FLAPPED;
+               TAILQ_REMOVE(&nexthop_runners, nh, runner_l);
+       }
 
        if (msg->connected) {
                nh->flags |= NEXTHOP_CONNECTED;
@@ -1855,8 +1857,12 @@ nexthop_unlink(struct prefix *p)
        if (p->nexthop == NULL || (p->flags & PREFIX_NEXTHOP_LINKED) == 0)
                return;
 
-       if (p == p->nexthop->next_prefix)
+       if (p == p->nexthop->next_prefix) {
                p->nexthop->next_prefix = LIST_NEXT(p, entry.list.nexthop);
+               /* remove nexthop from list if no prefixes left to update */
+               if (p->nexthop->next_prefix == NULL)
+                       TAILQ_REMOVE(&nexthop_runners, p->nexthop, runner_l);
+       }
 
        p->flags &= ~PREFIX_NEXTHOP_LINKED;
        LIST_REMOVE(p, entry.list.nexthop);

Reply via email to