Hi,

Unfortunately just having bgpd-gr-route-selection-fix.patch does not
fully fix the issue. It seems that su_remote can be null when
BGP_INFO_STALE is not set.

The peer would probably be in Connected state, so depending on how
bgpd-fix-route-selection-for-deleted-peer.patch is done, that would
probably fix it. I was not able to find this patch anywhere. Would it
be possible to post it?

Otherwise, I'll probably add explicit su_remote null checks; something
like:

diff --git a/bgpd/bgp_route.c b/bgpd/bgp_route.c
index 34ba1ab..ad8642e 100644
--- a/bgpd/bgp_route.c
+++ b/bgpd/bgp_route.c
@@ -348,9 +348,9 @@ bgp_info_cmp (struct bgp *bgp, struct bgp_info *new, struct 
bgp_info *exist,
   *paths_eq = 0;
 
   /* 0. Null check. */
-  if (new == NULL)
+  if (new == NULL || new->peer->su_remote == NULL)
     return 0;
-  if (exist == NULL)
+  if (exist == NULL || exists->peer->su_remote == NULL)
     return 1;
 
   newattr = new->attr;

Thanks,
Timo


On Wed, 17 Dec 2014 13:24:55 +0200
Timo Teras <[email protected]> wrote:

> Hi Vivek,
> 
> Thanks for the info. I believe the stale flag was set, so sounds like
> the patch you posted should fix it. I'll apply that and see how it
> goes.
> 
> I'd happy to look at the other two patches if possible.
> 
> Thanks,
> Timo
> 
> On Tue, 16 Dec 2014 11:58:35 -0800 (PST)
> Vivek Venkatraman <[email protected]> wrote:
> 
> > Hi Timo,
> > 
> > We (Cumulus) have seen a very similar crash (with same stack trace)
> > in a couple of different situations. For one scenario, we have
> > already submitted a patch which is up on the Cumulus quagga github
> > tree and will hopefully be rolled in soon. For the second scenario,
> > we have a patch that is currently in test and should be pushed by us
> > soon. In addition, there is a third scenario that we uncovered
> > recently that may potentially lead to this situation too, though we
> > haven't actually seen this manifestation. A fix for this is also in
> > test and should be part of the next set of patches that we push.
> > 
> > None of these 3 patches has anything to do with any other Cumulus
> > patches.
> > 
> > a) bgpd-gr-route-selection-fix.patch
> > 
> > This is the patch that is up on our tree and can be viewed at
> > https://github.com/CumulusNetworks/quagga/commit/94480b6b74ce32cf57e317630f2bafbfbd853eba.
> > 
> > This patch is to address the scenario when one of the paths in the
> > comparison has become "stale" as the peer which advertised this has
> > gracefully restarted. Stale paths should be considered in best path
> > selection in the same manner as non-stale paths, but cannot have the
> > neighbor address compared as the peer may not have reconnected and
> > may not have its su_remote information. 
> > 
> > You should be able to identify if this was your scenario based on
> > your configuration and/or looking at the "flags" of the route with
> > the NULL su_remote, if you are able to.
> > 
> > 
> > b) bgpd-fix-route-selection-for-deleted-peer.patch
> > 
> > Ensure that routes from a peer are not considered for best path
> > comparison if the peer is not in an Established state. There can be
> > a window between a peer being deleted and the background thread that
> > actually clears the routes (marks them as "removed") runs during
> > which best path may run. If this path selection compared two
> > prefixes all the way down to peer IP addresses and one of these two
> > peers had just been deleted, that peer would not have its sockunion
> > structures, especially su_remote, resulting in a BGPD exception.
> > 
> > Again, this would only matter to you if your scenario had peer
> > deletion taking place ocassionally.
> > 
> > 
> > c) bgpd-fix-clearing-completion.patch
> > 
> > This is potentially a more serious issue, though a few things have
> > to align for it to be hit. The Clearing_Completed event is meant to
> > be for the entire peering and expected to be received when the peer
> > is in Clearing state and upon receipt, transition to Idle state.
> > However, the way bgp_clear_route() is written, this event could get
> > generated multiple times, per AFI/SAFI. An example scenario is if
> > the peer has only negotiated IPv6 unicast, it will/could end up
> > receiving Clearing_Completed events for each SAFI of IPv4
> > address-family as nothing got added to the clear-node queue. The
> > first such event will transition the peer prematurely to Idle
> > state, while Clearing is still in progress (for the IPv6
> > address-family in the stated example). Subsequent events could
> > cause other bad things - e.g., a new connection is Established and
> > then the "real" Clearing_Completed event is received after
> > completion of clearing for the IPv6 address-family. This would
> > "silently" (i.e., not through bgp_stop()) transition the peer to
> > Idle and subsequent actions could clear "su_remote", but there may
> > be prefixes from the peer hanging around from the time the
> > connection got setup the 2nd time. This could then cause a crash
> > with stack similar to what you mention below.
> > 
> > Thanks,
> > Vivek
> > 
> > 
> > ----- Original Message -----
> > From: "Timo Teras" <[email protected]>
> > To: "quagga-dev" <[email protected]>
> > Sent: Monday, December 15, 2014 11:29:59 PM
> > Subject: [quagga-dev 11909] rare bgp crash
> > 
> > Hi,
> > 
> > I've experience a rare (once in a few weeks to a month) crash with
> > bgpd. Quagga version 0.99.23.1.
> > 
> > Back trace is as follows:
> > 
> > (gdb) where
> > #0  0x0000733b9354a130 in sockunion_cmp ()
> > from /usr/lib/libzebra.so.0 #1  0x00000f803aabaf1f in bgp_info_cmp
> > () #2  0x00000f803aabb5ed in bgp_best_selection ()
> > #3  0x00000f803aabb955 in bgp_process_main ()
> > #4  0x0000733b93568eb1 in work_queue_run ()
> > from /usr/lib/libzebra.so.0 #5  0x0000733b9354d4c7 in thread_call ()
> > from /usr/lib/libzebra.so.0 #6  0x00000f803aaa2b5c in main (argc=6,
> > argv=<optimized out>) at bgp_main.c:463
> >  
> > Unfortunately it seems the build did not honor CFLAGS="-g" fully, so
> > the debug info is not complete.
> > 
> > Based on disassembly, and register dumps, sockunion_cmp() is called
> > with su1=0, su2=0xf803ca52500.
> > 
> > bgp_info_cmp() calls sockunion_cmp() only in once place. This would
> > imply that the previously selected route's peer entry's su_remote is
> > null.
> > 
> > And the call to bgp_info_cmp() comes from the later code path ("no
> > bgp deterministic-med").
> > 
> > But clearly the crash is because selected prefix's peer has
> > su_remote = 0. I believe there are only two code paths when this
> > can happen:
> >  - bgp_start() is called and clears su_remote due to reconnect
> > request (other code paths seem to come from Idle state transitions)
> >  - there's bgp unbalanced bgp peer unreference, and the peer was
> >    deleted prematurely
> > 
> > The first option sounds more plausible. As there are certain
> > scenarios when the box scripts clear certain bgp peers using vtysh.
> > And most of the other places of bgpd code do seem to check for
> > su_remote != 0 before doing anything with it.
> > 
> > Any suggestions how to fix this? Just add check for su_remote != 0?
> > 
> > Thanks,
> > Timo
> > 
> > _______________________________________________
> > Quagga-dev mailing list
> > [email protected]
> > https://lists.quagga.net/mailman/listinfo/quagga-dev
> 


_______________________________________________
Quagga-dev mailing list
[email protected]
https://lists.quagga.net/mailman/listinfo/quagga-dev

Reply via email to