Your logs show segfaults (signal 11) so there should be a way to get core
dumps saved from these. You will definitely need the sysctl (and
pre-created /var/crash/bgpd directory) because by default cores are
disabled for processes which have changed uid/gid (RDE and session engine
do this). If the parent (root) process also crashes you should get one in
the current directory (from startup time) named bgpd.core but this didn't
happen in your log.
If you don't get anywhere with that you can attach to processes with "gdb
/usr/sbin/bgpd $pid" (will need each of the processes separately), this
will interrupt the process so at the gdb prompt quickly type "c" to
continue. Because the segfault happens so soon after startup for you you
will probably need to modify bgpd.conf to disable the peer ("passive" might
give you enough time) so you can connect gdb, continue, then enable the
peer. It's much easier to fix coredumps.
Others have seen what they think might be the same crash but not until much
longer after startup, whatever conditions are causing this to trigger so
quickly for you might go away as peers change things, so anything you can
grab while it's still happening would be quite useful.
It might be useful to get a packet capture too, "tcpdump -s1500 -i
$interface -w bgp.pcap port 179 and ip6", send it off list (probably at
least to me and Claudio) if you don't want to make it public.
--
Sent from a phone, apologies for poor formatting.
On 19 September 2018 11:53:14 "Aaron A. Glenn" <[email protected]> wrote:
* Claudio Jeker <[email protected]> [2018-09-19 09:15]:
On Tue, Sep 18, 2018 at 07:12:39PM +0000, Aaron A. Glenn wrote:
Sep 18 19:06:10 nairobi bgpd[92056]: startup
Sep 18 19:06:10 nairobi bgpd[92056]: rereading config
Sep 18 19:06:10 nairobi bgpd[97583]: route decision engine ready
Sep 18 19:06:10 nairobi bgpd[99160]: session engine ready
Sep 18 19:06:10 nairobi bgpd[99160]: listening on 0.0.0.0
Sep 18 19:06:10 nairobi bgpd[99160]: listening on ::
Sep 18 19:06:10 nairobi bgpd[99160]: SE reconfigured
Sep 18 19:06:10 nairobi bgpd[99160]: neighbor 2xxxx:xxxx::4 (nbo-v6): state
change None -> Idle, reason: None
Sep 18 19:06:10 nairobi bgpd[99160]: neighbor 2xxxx:xxxx::4 (nbo-v6): state
change Idle -> Connect, reason: Start
Sep 18 19:06:10 nairobi bgpd[97583]: change to/from route-collector mode
ignored
Sep 18 19:06:10 nairobi bgpd[97583]: RDE reconfigured
Sep 18 19:06:10 nairobi bgpd[97583]: running softreconfig in
Sep 18 19:06:10 nairobi bgpd[99160]: neighbor 2xxxx:xxxx::4 (nbo-v6): state
change Connect -> OpenSent, reason: Connection opened
Sep 18 19:06:10 nairobi bgpd[97583]: RDE soft reconfiguration done
Sep 18 19:06:10 nairobi bgpd[99160]: neighbor 2xxxx:xxxx::4 (nbo-v6): state
change OpenSent -> OpenConfirm, reason: OPEN message received
Sep 18 19:06:10 nairobi bgpd[99160]: neighbor 2xxxx:xxxx::4 (nbo-v6): state
change OpenConfirm -> Established, reason: KEEPALIVE message received
Sep 18 19:06:10 nairobi bgpd[97583]: neighbor 2xxxx:xxxx::4 (nbo-v6):
sending IPv6 unicast EOR marker
Sep 18 19:06:12 nairobi bgpd[97583]: neighbor 2xxxx:xxxx::4 (nbo-v6): bad
ASPATH, path invalidated and prefix withdrawn
Sep 18 19:06:12 nairobi last message repeated 157 times
Sep 18 19:06:12 nairobi bgpd[92056]: peer closed imsg connection
Sep 18 19:06:12 nairobi bgpd[99160]: peer closed imsg connection
Sep 18 19:06:12 nairobi bgpd[99160]: SE: Lost connection to RDE
Sep 18 19:06:12 nairobi bgpd[99160]: peer closed imsg connection
Sep 18 19:06:12 nairobi bgpd[99160]: SE: Lost connection to RDE control
Sep 18 19:06:12 nairobi bgpd[92056]: main: Lost connection to RDE
Sep 18 19:06:12 nairobi bgpd[92056]: session engine terminated; signal 11
Sep 18 19:06:12 nairobi bgpd[92056]: route decision engine terminated;
signal 11
Sep 18 19:06:12 nairobi bgpd[92056]: terminating
nairobi# sysctl kern.version
kern.version=OpenBSD 6.4-beta (GENERIC.MP) #301: Tue Sep 18 08:25:16 MDT 2018
[email protected]:/usr/src/sys/arch/amd64/compile/GENERIC.MP
Can you try a later snapshot? The 9. Sept was right in the middle of n2k18
and it could be that this is just an unlucky snapshot with something that
is already fixed. If that does not help Stuart Henderson's mail has good
instructions.
I initially experienced the behavior with a 9 Sept snapshot, and immediately
re-deployed to the latest snapshot (18 Sept). Same result (bad ASPATH might be
new, but don't quote me on that).
If you can get a backtrace for the RDE process crashing that would be
very helpful for me. One way of doing that is to attach gdb to the running
process or to get a core dump with kern.nosuidcoredump=3 set (and mkdir
/var/crash/bgpd)
I'm unsure how to get a backtrace, short of finding the right breakpoint to
set. Using all of my gdb knowledge, I set `follow-fork-mode child` and `catch
fork` but, that wasn't very fruitful (especially w/o symbols)
I will find time to follow sthens (wonderfully complete) instructions and/or
make another attempt at getting a useful gdb backtrace after EuroBSDcon,
provided the latest snapshot then exhibits similar behavior.
Since it doesn't (seem to) coredump, I was unable to coax anything into
/var/crash/bgpd