Good to hear this may possibly resolve your issue. I shall post the other 2 patches real soon.
Thanks, Vivek ----- Original Message ----- From: "Timo Teras" <[email protected]> To: "Vivek Venkatraman" <[email protected]> Cc: "quagga-dev" <[email protected]> Sent: Wednesday, December 17, 2014 3:24:55 AM Subject: Re: [quagga-dev 11909] rare bgp crash Hi Vivek, Thanks for the info. I believe the stale flag was set, so sounds like the patch you posted should fix it. I'll apply that and see how it goes. I'd happy to look at the other two patches if possible. Thanks, Timo On Tue, 16 Dec 2014 11:58:35 -0800 (PST) Vivek Venkatraman <[email protected]> wrote: > Hi Timo, > > We (Cumulus) have seen a very similar crash (with same stack trace) > in a couple of different situations. For one scenario, we have > already submitted a patch which is up on the Cumulus quagga github > tree and will hopefully be rolled in soon. For the second scenario, > we have a patch that is currently in test and should be pushed by us > soon. In addition, there is a third scenario that we uncovered > recently that may potentially lead to this situation too, though we > haven't actually seen this manifestation. A fix for this is also in > test and should be part of the next set of patches that we push. > > None of these 3 patches has anything to do with any other Cumulus > patches. > > a) bgpd-gr-route-selection-fix.patch > > This is the patch that is up on our tree and can be viewed at > https://github.com/CumulusNetworks/quagga/commit/94480b6b74ce32cf57e317630f2bafbfbd853eba. > > This patch is to address the scenario when one of the paths in the > comparison has become "stale" as the peer which advertised this has > gracefully restarted. Stale paths should be considered in best path > selection in the same manner as non-stale paths, but cannot have the > neighbor address compared as the peer may not have reconnected and > may not have its su_remote information. > > You should be able to identify if this was your scenario based on > your configuration and/or looking at the "flags" of the route with > the NULL su_remote, if you are able to. > > > b) bgpd-fix-route-selection-for-deleted-peer.patch > > Ensure that routes from a peer are not considered for best path > comparison if the peer is not in an Established state. There can be a > window between a peer being deleted and the background thread that > actually clears the routes (marks them as "removed") runs during > which best path may run. If this path selection compared two prefixes > all the way down to peer IP addresses and one of these two peers had > just been deleted, that peer would not have its sockunion structures, > especially su_remote, resulting in a BGPD exception. > > Again, this would only matter to you if your scenario had peer > deletion taking place ocassionally. > > > c) bgpd-fix-clearing-completion.patch > > This is potentially a more serious issue, though a few things have to > align for it to be hit. The Clearing_Completed event is meant to be > for the entire peering and expected to be received when the peer is > in Clearing state and upon receipt, transition to Idle state. > However, the way bgp_clear_route() is written, this event could get > generated multiple times, per AFI/SAFI. An example scenario is if the > peer has only negotiated IPv6 unicast, it will/could end up receiving > Clearing_Completed events for each SAFI of IPv4 address-family as > nothing got added to the clear-node queue. The first such event will > transition the peer prematurely to Idle state, while Clearing is > still in progress (for the IPv6 address-family in the stated > example). Subsequent events could cause other bad things - e.g., a > new connection is Established and then the "real" Clearing_Completed > event is received after completion of clearing for the IPv6 > address-family. This would "silently" (i.e., not through bgp_stop()) > transition the peer to Idle and subsequent actions could clear > "su_remote", but there may be prefixes from the peer hanging around > from the time the connection got setup the 2nd time. This could then > cause a crash with stack similar to what you mention below. > > Thanks, > Vivek > > > ----- Original Message ----- > From: "Timo Teras" <[email protected]> > To: "quagga-dev" <[email protected]> > Sent: Monday, December 15, 2014 11:29:59 PM > Subject: [quagga-dev 11909] rare bgp crash > > Hi, > > I've experience a rare (once in a few weeks to a month) crash with > bgpd. Quagga version 0.99.23.1. > > Back trace is as follows: > > (gdb) where > #0 0x0000733b9354a130 in sockunion_cmp () from /usr/lib/libzebra.so.0 > #1 0x00000f803aabaf1f in bgp_info_cmp () > #2 0x00000f803aabb5ed in bgp_best_selection () > #3 0x00000f803aabb955 in bgp_process_main () > #4 0x0000733b93568eb1 in work_queue_run () > from /usr/lib/libzebra.so.0 #5 0x0000733b9354d4c7 in thread_call () > from /usr/lib/libzebra.so.0 #6 0x00000f803aaa2b5c in main (argc=6, > argv=<optimized out>) at bgp_main.c:463 > > Unfortunately it seems the build did not honor CFLAGS="-g" fully, so > the debug info is not complete. > > Based on disassembly, and register dumps, sockunion_cmp() is called > with su1=0, su2=0xf803ca52500. > > bgp_info_cmp() calls sockunion_cmp() only in once place. This would > imply that the previously selected route's peer entry's su_remote is > null. > > And the call to bgp_info_cmp() comes from the later code path ("no bgp > deterministic-med"). > > But clearly the crash is because selected prefix's peer has su_remote > = 0. I believe there are only two code paths when this can happen: > - bgp_start() is called and clears su_remote due to reconnect request > (other code paths seem to come from Idle state transitions) > - there's bgp unbalanced bgp peer unreference, and the peer was > deleted prematurely > > The first option sounds more plausible. As there are certain scenarios > when the box scripts clear certain bgp peers using vtysh. And most of > the other places of bgpd code do seem to check for su_remote != 0 > before doing anything with it. > > Any suggestions how to fix this? Just add check for su_remote != 0? > > Thanks, > Timo > > _______________________________________________ > Quagga-dev mailing list > [email protected] > https://lists.quagga.net/mailman/listinfo/quagga-dev _______________________________________________ Quagga-dev mailing list [email protected] https://lists.quagga.net/mailman/listinfo/quagga-dev
