Re: openBGPd crashes in 6.2 and 6.3: "a politician in the decision process"

2018-09-10 Thread Daniel Stocker
Hi,

> We haven't got any dumps from when the problem happened. Would there
> still be a point now that the RS is stable?

i spoke with the Guys which caused the crash (https://communityrack.org) and 
asked them what they did exactly.

they said that they changed their AS Nr on their frr 5 router. after a short 
shut / no shut, the famous „fatal in RDE“ appeared.

Important detail. it happened after restarting the ipv6 session. Seems like 
ipv4 didn’t trigger the bug.

hope that helps.

cheers
stoege



openBGPd crashes in 6.2 and 6.3: "a politician in the decision process"

2018-09-10 Thread Pietro Stäheli
Hi Remi,

Sorry I didn't see your email sooner, Gmail decided to classify it as spam..

> I move this over to bugs@ which feels more appropriate to me. Chances that
> a bgpd developer sees it is higher.
> 
> Somehow there where two identical paths and bgpd could not figure out
> which one to prefer. The 13 steps bgpd goes through to find the best are
> listed in the bgpd man page.

That is the conclusion we arrived at as well.

> Do you have a pcap file with the bgp traffic or an mrt dump? (see dump in
> man bgpd.conf). This could answer the question why you had two identical
> path.
> 

We haven't got any dumps from when the problem happened. Would there
still be a point now that the RS is stable?

> Showing your config and your regular bgpctl commands could also help with
> analyzing the problem.

bgpctl command is as follows, it checks for the number of
non-established peers:

bgpctl show | grep ' 00:0' | grep -v "/" | wc -l


Slightly censored bgpd.conf:

https://pastebin.com/mqjpjHCx

On the other RS we added "announce restart no" to the configuration as
was advised earlier, just to see if that would be the difference in case
of another problem.

Best regards,
Pietro



Re: openBGPd crashes in 6.2 and 6.3: "a politician in the decision process"

2018-08-24 Thread Remi Locherer
Hi Pietro,

On Thu, Aug 23, 2018 at 10:05:30AM +0200, Pietro Stäheli wrote:
> Hi,
> 
> openBGPd is running at an internet exchange, two openBSD route servers
> (rs3 on openBSD 6.3 and rs4 on openBSD 6.2, both virtual machines on
> different hypervisors in different locations) connect with peering
> customers.
> 
> We've experienced crashes in openBGPd twice in the past two weeks. Both
> times with the same error message: "fatal in RDE: Uh, oh a politician in
> the decision process". These error messages are logged on both route
> servers right before they crash within seconds of each other.
> 
> The route servers had been running quite reliably for a long time before
> the recent incidents. The daemon can then be restarted without an issue.
> CPU usage prior to the crash is minimal (<5%).
> 
> In the minutes before the crash we're seeing error messages like the
> following in daemon.log:
> 
> bgpd[23099]: no such peer: id=4294967037
> 
> 
> Sample of logs just before the crash:
> 
> Aug 22 15:38:58 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 91.206.52.170
> AS6939: update 81.163.124.0/24 via 91.206.52.170
> Aug 22 15:38:58 rs3 bgpd[23099]: no such peer: id=4294967037
> Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::11
> AS31424: withdraw 2a01:6a8::/32
> Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::bf
> AS33891: withdraw 2a01:6a8::/32
> Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::aa
> AS6939: withdraw 2804:364c::/33
> Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::aa
> AS6939: withdraw 2804:364c:8000::/33
> Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::11
> AS31424: update 2a01:6a8::/32 via 2001:7f8:24::11
> Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::aa
> AS6939: update 2804:364c::/33 via 2001:7f8:24::aa
> Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::aa
> AS6939: update 2804:364c:8000::/33 via 2001:7f8:24::aa
> Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::bf
> AS33891: update 2a01:6a8::/32 via 2001:7f8:24::bf
> Aug 22 15:39:00 rs3 bgpd[23099]: Connection attempt from neighbor
> 91.206.52.139 while session is in state Idle
> Aug 22 15:39:01 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 91.206.52.96
> AS31042: update 185.64.172.0/24 via 91.206.52.96
> Aug 22 15:39:01 rs3 bgpd[43763]: fatal in RDE: Uh, oh a politician in
> the decision process
> Aug 22 15:39:01 rs3 bgpd[99961]: peer closed imsg connection
> Aug 22 15:39:01 rs3 bgpd[99961]: main: Lost connection to RDE
> Aug 22 15:39:01 rs3 bgpd[23099]: peer closed imsg connection
> Aug 22 15:39:01 rs3 bgpd[23099]: SE: Lost connection to parent
> 
> 
> Logs just before the "no such peer" messages appear:
> 
> Aug 22 15:36:43 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 91.206.52.54
> AS34554: update 80.75.112.0/20 via 91.206.52.54
> Aug 22 15:36:43 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::36
> AS34554: update 2a01:6a8::/32 via 2001:7f8:24::36
> Aug 22 15:36:44 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::bf
> AS33891: update 2a0d:8d80::/32 via 2001:7f8:24::bf
> Aug 22 15:36:47 rs3 bgpd[23099]: neighbor 91.206.52.96: graceful restart
> of IPv4 unicast, keeping routes
> Aug 22 15:36:47 rs3 bgpd[23099]: neighbor 91.206.52.96: state change
> Established -> Idle, reason: Connection closed
> Aug 22 15:36:47 rs3 bgpd[23099]: neighbor 91.206.52.96: removed
> Aug 22 15:36:49 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::11
> AS31424: withdraw 2a01:6a8::/32
> Aug 22 15:36:49 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::bf
> AS33891: withdraw 2a01:6a8::/32
> Aug 22 15:36:49 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::11
> AS31424: update 2a01:6a8::/32 via 2001:7f8:24::11
> Aug 22 15:36:49 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::bf
> AS33891: update 2a01:6a8::/32 via 2001:7f8:24::bf
> Aug 22 15:36:54 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::bf
> AS33891: update 2a0d:8d80::/32 via 2001:7f8:24::bf
> Aug 22 15:36:55 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 91.206.52.170
> AS6939: update 197.249.160.0/19 via 91.206.52.170
> Aug 22 15:36:55 rs3 bgpd[23099]: no such peer: id=4294967037
> 
> 
> 
> I haven't found much about the error message apart from this mailing
> list thread: https://www.mail-archive.com/misc@openbsd.org/msg04565.html
> 
> The thread suggests that invoking bgpctl may cause the failure. 'bgpctl
> show' is invoked every few minutes through our monitoring system to
> check on the status of peer connections.
> 
> Can anybody point me to a possible cause or troubleshooting regarding
> this issue? Could a misconfigured/broken peer be the cause?  Has anybody
> dealt with a similar problem?
> 
> I can provide bgpd.conf and full logs of both incidents if necessary.

I move this over to bugs@ which feels more appropriate to me. Chances that
a bgpd developer sees it is higher.

Somehow there where two identical paths and bgpd could not figure out