Hi Timo,

We (Cumulus) have seen a very similar crash (with same stack trace) in a couple 
of different situations. For one scenario, we have already submitted a patch 
which is up on the Cumulus quagga github tree and will hopefully be rolled in 
soon. For the second scenario, we have a patch that is currently in test and 
should be pushed by us soon. In addition, there is a third scenario that we 
uncovered recently that may potentially lead to this situation too, though we 
haven't actually seen this manifestation. A fix for this is also in test and 
should be part of the next set of patches that we push.

None of these 3 patches has anything to do with any other Cumulus patches.

a) bgpd-gr-route-selection-fix.patch

This is the patch that is up on our tree and can be viewed at 
https://github.com/CumulusNetworks/quagga/commit/94480b6b74ce32cf57e317630f2bafbfbd853eba.

This patch is to address the scenario when one of the paths in the comparison 
has become "stale" as the peer which advertised this has gracefully restarted. 
Stale paths should be considered in best path selection in the same manner as 
non-stale paths, but cannot have the neighbor address compared as the peer may 
not have reconnected and may not have its su_remote information. 

You should be able to identify if this was your scenario based on your 
configuration and/or looking at the "flags" of the route with the NULL 
su_remote, if you are able to.


b) bgpd-fix-route-selection-for-deleted-peer.patch

Ensure that routes from a peer are not considered for best path comparison if 
the peer is not in an Established state. There can be a window between a peer 
being deleted and the background thread that actually clears the routes (marks 
them as "removed") runs during which best path may run. If this path selection 
compared two prefixes all the way down to peer IP addresses and one of these 
two peers had just been deleted, that peer would not have its sockunion 
structures, especially su_remote, resulting in a BGPD exception.

Again, this would only matter to you if your scenario had peer deletion taking 
place ocassionally.


c) bgpd-fix-clearing-completion.patch

This is potentially a more serious issue, though a few things have to align for 
it to be hit. The Clearing_Completed event is meant to be for the entire 
peering and expected to be received when the peer is in Clearing state and upon 
receipt, transition to Idle state. However, the way bgp_clear_route() is 
written, this event could get generated multiple times, per AFI/SAFI. An 
example scenario is if the peer has only negotiated IPv6 unicast, it will/could 
end up receiving Clearing_Completed events for each SAFI of IPv4 address-family 
as nothing got added to the clear-node queue. The first such event will 
transition the peer prematurely to Idle state, while Clearing is still in 
progress (for the IPv6 address-family in the stated example). Subsequent events 
could cause other bad things - e.g., a new connection is Established and then 
the "real" Clearing_Completed event is received after completion of clearing 
for the IPv6 address-family. This would "silently" (i.e., not through bgp_stop
 ()) transition the peer to Idle and subsequent actions could clear 
"su_remote", but there may be prefixes from the peer hanging around from the 
time the connection got setup the 2nd time. This could then cause a crash with 
stack similar to what you mention below.

Thanks,
Vivek


----- Original Message -----
From: "Timo Teras" <[email protected]>
To: "quagga-dev" <[email protected]>
Sent: Monday, December 15, 2014 11:29:59 PM
Subject: [quagga-dev 11909] rare bgp crash

Hi,

I've experience a rare (once in a few weeks to a month) crash with
bgpd. Quagga version 0.99.23.1.

Back trace is as follows:

(gdb) where
#0  0x0000733b9354a130 in sockunion_cmp () from /usr/lib/libzebra.so.0
#1  0x00000f803aabaf1f in bgp_info_cmp ()
#2  0x00000f803aabb5ed in bgp_best_selection ()
#3  0x00000f803aabb955 in bgp_process_main ()
#4  0x0000733b93568eb1 in work_queue_run () from /usr/lib/libzebra.so.0
#5  0x0000733b9354d4c7 in thread_call () from /usr/lib/libzebra.so.0
#6  0x00000f803aaa2b5c in main (argc=6, argv=<optimized out>) at
bgp_main.c:463
 
Unfortunately it seems the build did not honor CFLAGS="-g" fully, so
the debug info is not complete.

Based on disassembly, and register dumps, sockunion_cmp() is called
with su1=0, su2=0xf803ca52500.

bgp_info_cmp() calls sockunion_cmp() only in once place. This would
imply that the previously selected route's peer entry's su_remote is
null.

And the call to bgp_info_cmp() comes from the later code path ("no bgp
deterministic-med").

But clearly the crash is because selected prefix's peer has su_remote
= 0. I believe there are only two code paths when this can happen:
 - bgp_start() is called and clears su_remote due to reconnect request
   (other code paths seem to come from Idle state transitions)
 - there's bgp unbalanced bgp peer unreference, and the peer was
   deleted prematurely

The first option sounds more plausible. As there are certain scenarios
when the box scripts clear certain bgp peers using vtysh. And most of
the other places of bgpd code do seem to check for su_remote != 0
before doing anything with it.

Any suggestions how to fix this? Just add check for su_remote != 0?

Thanks,
Timo

_______________________________________________
Quagga-dev mailing list
[email protected]
https://lists.quagga.net/mailman/listinfo/quagga-dev

_______________________________________________
Quagga-dev mailing list
[email protected]
https://lists.quagga.net/mailman/listinfo/quagga-dev

Reply via email to