Hi BIRD users,

Does anyone know whether a BGP shared secret can be rotated without incurring 
any network downtime? I did some testing with the BGP password functionality 
offered and it appears that any update to the BGP password configuration incurs 
a brief network outage with both existing/new connections. It seems like 
something about the way BIRD is restarting is leading to it pulling down 
learned routes immediately as opposed to letting them live according to the 
timeout setting. Does BIRD flush all routes it has learned when this 
configuration changes? Here is a brief excerpt to demonstrate the outage. Take 
note that the network disruption precisely matches the timestamp at which BIRD 
is reconfigured:

# Logs from calico-node-4w4bv (10.95.14.104)
2022-06-29 18:26:18.836 [INFO][60] confd/client.go 1415: Trigger to recheck BGP 
peers following possible password update
2022-06-29 18:26:18.836 [INFO][60] confd/client.go 250: Recompute v1 BGP 
peerings
2022-06-29 18:26:18.836 [INFO][60] confd/client.go 949: Recompute BGP peerings:
bird: Reconfiguration requested by SIGHUP
2022-06-29 18:26:18.844 [INFO][60] confd/resource.go 278: Target config 
/etc/calico/confd/config/bird.cfg has been updated due to change in key: 
/calico/bgp/v1/global
bird: Reconfiguring
bird: device1: Reconfigured
bird: direct1: Reconfigured
bird: Restarting protocol Mesh_10_95_14_105
bird: Mesh_10_95_14_105: Shutting down
bird: Mesh_10_95_14_105: State changed to stop
bird: Restarting protocol Mesh_10_95_14_110
bird: Mesh_10_95_14_110: Shutting down
bird: Mesh_10_95_14_110: State changed to stop
bird: Mesh_10_95_14_105: State changed to down
bird: Mesh_10_95_14_105: Initializing
bird: Mesh_10_95_14_105: Starting
bird: Mesh_10_95_14_105: State changed to start
bird: Mesh_10_95_14_110: State changed to down
bird: Mesh_10_95_14_110: Initializing
bird: Mesh_10_95_14_110: Starting
bird: Mesh_10_95_14_110: State changed to start
bird: Reconfigured
bird: Mesh_10_95_14_110: Connected to table master
bird: Mesh_10_95_14_110: State changed to feed
bird: Mesh_10_95_14_110: State changed to up
bird: Mesh_10_95_14_105: Connected to table master
bird: Mesh_10_95_14_105: State changed to feed
bird: Mesh_10_95_14_105: State changed to up
2022-06-29 18:26:36.079 [INFO][58] monitor-addresses/autodetection_methods.go 
117: Using autodetected IPv4 address 10.95.14.104/26 on matching interface eth0
2022-06-29 18:26:54.537 [INFO][62] felix/summary.go 100: Summarising 9 
dataplane reconciliation loops over 1m2.1s: avg=5ms longest=21ms 
(resync-filter-v4,resync-nat-v4)

# Connection Tester Daemonset (hitting an echoserver twice a second or so)

Wed Jun 29 18:26:17 UTC 2022
Successful echo server connection
Wed Jun 29 18:26:17 UTC 2022

Wed Jun 29 18:26:18 UTC 2022
Successful echo server connection
Wed Jun 29 18:26:18 UTC 2022

Wed Jun 29 18:26:18 UTC 2022
curl: (28) Connection timed out after 300 milliseconds
Failed to connect to echo server
Wed Jun 29 18:26:19 UTC 2022

Wed Jun 29 18:26:19 UTC 2022
curl: (28) Connection timed out after 300 milliseconds
Failed to connect to echo server
Wed Jun 29 18:26:20 UTC 2022

Wed Jun 29 18:26:20 UTC 2022
Successful echo server connection
Wed Jun 29 18:26:20 UTC 2022

# For peer /host/10.95.14.81/ip_addr_v4
protocol bgp Mesh_10_95_14_81 from bgp_template {
  neighbor 10.95.14.81 as 64512;
  source address 10.95.14.82;  # The local address we use for the TCP connection
  graceful restart time 1800; # This parameter seems to make no difference when 
changing BGP passwords
  password "LJiKASiglY+KafEwEn/cSmkiok0zHgpQq5EtYhYgoDcSQwKIpX22Tz7jOzX+";
}


I have perused the RFCs for both BGP Graceful Restart (4724) & Secure BGP 
Sessions (2385) but haven't found a solid answer yet. When the password is 
changed it makes complete sense that any peers with the new password will 
refuse to accept any NEW routes received from peers using the old password and 
vice versa. I don't see the fundamental reason why TCP segments arriving with 
some unexpected hash necessitates that previously learned routes from that peer 
need to be flushed with no TTL, but the observance of the outage suggests that 
is what is happening. One would think that, in principle, it could wait to tear 
down existing routes until a configurable timeout (say the graceful restart) 
expires, providing a window in which we can change the password and maintain 
stable routing.

I am relatively new to BGP and am using BIRD indirectly via Calico for 
container networking inside Kubernetes. I will of course take things up with 
the guys behind Calico, but is there anything in the BGP spec/BIRD 
implementation which fundamentally prevents network disruption free secret 
rotation? Let me know if there is any place I should look for more information 
on this or any debug logs which would be helpful.

Thanks,
Calvin

Reply via email to