Hi,

openBGPd is running at an internet exchange, two openBSD route servers
(rs3 on openBSD 6.3 and rs4 on openBSD 6.2, both virtual machines on
different hypervisors in different locations) connect with peering
customers.

We've experienced crashes in openBGPd twice in the past two weeks. Both
times with the same error message: "fatal in RDE: Uh, oh a politician in
the decision process". These error messages are logged on both route
servers right before they crash within seconds of each other.

The route servers had been running quite reliably for a long time before
the recent incidents. The daemon can then be restarted without an issue.
CPU usage prior to the crash is minimal (<5%).

In the minutes before the crash we're seeing error messages like the
following in daemon.log:

bgpd[23099]: no such peer: id=4294967037


Sample of logs just before the crash:

Aug 22 15:38:58 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 91.206.52.170
AS6939: update 81.163.124.0/24 via 91.206.52.170
Aug 22 15:38:58 rs3 bgpd[23099]: no such peer: id=4294967037
Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::11
AS31424: withdraw 2a01:6a8::/32
Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::bf
AS33891: withdraw 2a01:6a8::/32
Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::aa
AS6939: withdraw 2804:364c::/33
Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::aa
AS6939: withdraw 2804:364c:8000::/33
Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::11
AS31424: update 2a01:6a8::/32 via 2001:7f8:24::11
Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::aa
AS6939: update 2804:364c::/33 via 2001:7f8:24::aa
Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::aa
AS6939: update 2804:364c:8000::/33 via 2001:7f8:24::aa
Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::bf
AS33891: update 2a01:6a8::/32 via 2001:7f8:24::bf
Aug 22 15:39:00 rs3 bgpd[23099]: Connection attempt from neighbor
91.206.52.139 while session is in state Idle
Aug 22 15:39:01 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 91.206.52.96
AS31042: update 185.64.172.0/24 via 91.206.52.96
Aug 22 15:39:01 rs3 bgpd[43763]: fatal in RDE: Uh, oh a politician in
the decision process
Aug 22 15:39:01 rs3 bgpd[99961]: peer closed imsg connection
Aug 22 15:39:01 rs3 bgpd[99961]: main: Lost connection to RDE
Aug 22 15:39:01 rs3 bgpd[23099]: peer closed imsg connection
Aug 22 15:39:01 rs3 bgpd[23099]: SE: Lost connection to parent


Logs just before the "no such peer" messages appear:

Aug 22 15:36:43 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 91.206.52.54
AS34554: update 80.75.112.0/20 via 91.206.52.54
Aug 22 15:36:43 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::36
AS34554: update 2a01:6a8::/32 via 2001:7f8:24::36
Aug 22 15:36:44 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::bf
AS33891: update 2a0d:8d80::/32 via 2001:7f8:24::bf
Aug 22 15:36:47 rs3 bgpd[23099]: neighbor 91.206.52.96: graceful restart
of IPv4 unicast, keeping routes
Aug 22 15:36:47 rs3 bgpd[23099]: neighbor 91.206.52.96: state change
Established -> Idle, reason: Connection closed
Aug 22 15:36:47 rs3 bgpd[23099]: neighbor 91.206.52.96: removed
Aug 22 15:36:49 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::11
AS31424: withdraw 2a01:6a8::/32
Aug 22 15:36:49 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::bf
AS33891: withdraw 2a01:6a8::/32
Aug 22 15:36:49 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::11
AS31424: update 2a01:6a8::/32 via 2001:7f8:24::11
Aug 22 15:36:49 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::bf
AS33891: update 2a01:6a8::/32 via 2001:7f8:24::bf
Aug 22 15:36:54 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::bf
AS33891: update 2a0d:8d80::/32 via 2001:7f8:24::bf
Aug 22 15:36:55 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 91.206.52.170
AS6939: update 197.249.160.0/19 via 91.206.52.170
Aug 22 15:36:55 rs3 bgpd[23099]: no such peer: id=4294967037



I haven't found much about the error message apart from this mailing
list thread: https://www.mail-archive.com/misc@openbsd.org/msg04565.html

The thread suggests that invoking bgpctl may cause the failure. 'bgpctl
show' is invoked every few minutes through our monitoring system to
check on the status of peer connections.

Can anybody point me to a possible cause or troubleshooting regarding
this issue? Could a misconfigured/broken peer be the cause?  Has anybody
dealt with a similar problem?

I can provide bgpd.conf and full logs of both incidents if necessary.


Best regards,
Pietro Stäheli

Reply via email to