[quagga-dev 11909] rare bgp crash

Timo Teras Mon, 15 Dec 2014 23:35:35 -0800

Hi,

I've experience a rare (once in a few weeks to a month) crash with
bgpd. Quagga version 0.99.23.1.


Back trace is as follows:

(gdb) where
#0  0x0000733b9354a130 in sockunion_cmp () from /usr/lib/libzebra.so.0
#1  0x00000f803aabaf1f in bgp_info_cmp ()
#2  0x00000f803aabb5ed in bgp_best_selection ()
#3  0x00000f803aabb955 in bgp_process_main ()
#4  0x0000733b93568eb1 in work_queue_run () from /usr/lib/libzebra.so.0
#5  0x0000733b9354d4c7 in thread_call () from /usr/lib/libzebra.so.0
#6  0x00000f803aaa2b5c in main (argc=6, argv=<optimized out>) at
bgp_main.c:463
 
Unfortunately it seems the build did not honor CFLAGS="-g" fully, so
the debug info is not complete.

Based on disassembly, and register dumps, sockunion_cmp() is called
with su1=0, su2=0xf803ca52500.

bgp_info_cmp() calls sockunion_cmp() only in once place. This would
imply that the previously selected route's peer entry's su_remote is
null.

And the call to bgp_info_cmp() comes from the later code path ("no bgp
deterministic-med").

But clearly the crash is because selected prefix's peer has su_remote
= 0. I believe there are only two code paths when this can happen:
 - bgp_start() is called and clears su_remote due to reconnect request
   (other code paths seem to come from Idle state transitions)
 - there's bgp unbalanced bgp peer unreference, and the peer was
   deleted prematurely

The first option sounds more plausible. As there are certain scenarios
when the box scripts clear certain bgp peers using vtysh. And most of
the other places of bgpd code do seem to check for su_remote != 0
before doing anything with it.

Any suggestions how to fix this? Just add check for su_remote != 0?

Thanks,
Timo

_______________________________________________
Quagga-dev mailing list
[email protected]
https://lists.quagga.net/mailman/listinfo/quagga-dev

[quagga-dev 11909] rare bgp crash

Reply via email to