On 2012-05-29, Matt Hamilton <ma...@netsight.co.uk> wrote:
> Otto Moerbeek <otto <at> drijf.net> writes:
>
>> 
>> On Tue, May 29, 2012 at 08:57:54AM +0000, Matt Hamilton wrote:
>> 
>> > Hi all,
>> > 
>> > More bgpd problems last night :( This happened last night on two of our
>> > routers. One running an old version of OpenBSD (4.3) and one running
>> > 5.1. Is there anyone out there actually using bpgd in production? How
>> > do you deal with it quitting everytime something unexpected happens on
>> > the network?
>> 
>> Yes, lots of people run it in production. 
>
> That is what I'd expect. I just don't understand how with it keep dropping
> out when it has some transient problem.
>
>> > 
>> > The first message below seems to indicate unable to allocate
>> > memory. I'm running these boxes pretty much stock having not tuned any
>> > parameters at all. Both are just running routing daemons (bgpd, ospf)
>> > and the 4.3 box is running OpenVPN. There are no applications running
>> > and both boxes have plenty of RAM (4GB) and not using any swap or
>> > anything.
>> > 
>> > Is there something I should look at tuning in terms
>> > of memory allocation in order to stop this happening?
>> > 
>> > OpenBSD 4.3/amd64:
>> > 
>> > May 29 05:53:43 firewall1 bgpd[5090]: imsg_create: buf_open: Cannot
>> > allocate memory
>> > May 29 05:53:43 firewall1 bgpd[5090]: fatal in RDE: imsg_compose
>> > error: Cannot allocate memory
>> > May 29 05:53:44 firewall1 bgpd[27053]: Lost child: route decision
>> > engine exited
>> > May 29 05:53:44 firewall1 bgpd[15204]: fatal in SE: pipe write error:
>> > Broken pipe
>> 
>> Only solution: upgrading. You are runing unsupported software, a
>> foolish thing to do.
>
> Alas we don't all live in Utopia ;) This box is due to be upgraded soon, 
> but that upgrade is predicated on getting a stable routing environment
> so that I can do so. At the moment we are mid-way through migrating
> away from Cisco kit to OpenBSD routers. Until I can be confident that it
> won't all just fall over I can't continue with the migration.

I would *not* want to be running ospfd from before 5.1 on a DFZ
router. First RTM_DESYNC (route socket overflows) were not dealt with
at all in ospfd until 4.8 and from then until 5.1 they tended to
result in lots of kernel route table dumps in quick succession to
get back into sync, which is pretty hard on the machine, in 5.1
a holdoff timer was introduced for these resyncs. bgpd-wise since
4.3 there have been crashes fixed triggered by bad updates (these
affected most BGP implementations not just OpenBSD) and numerous
other fixes. If you are upgrading from that version then use bsd.rd
to upgrade rather than untarring sets on the live system, and read
the upgrade notes for the intermediate versions, I think that time
period includes slight incompatible changes to bgpd.conf.

> So any insight on why I would be getting the same symptoms on the 5.1
> box? And was getting bgpd dying before under 5.0? I'm finding it hard
> to believe that this behaviour would have been tolerated by people 
> running bgpd in production all the way from the time of 4.3 to now.
> Which leads to the only conclusion... I'm doing something stupid.
> The question is what. I have ospfd and bgpd running. On the 5.1 box
> there is also a CARP interface too (not an interface we are using ospfd on).
>
> -Matt
>
>

Not sure when I started seeing it as I had various other problems
on the network and with hardware back in the 4.3 days (what's that,
4 years ago or so?)                                                             
                     

Some people don't seem to hit it at all. One of the most common
uses of OpenBGP is running as route server with mostly LAN-based
connections and I suspect this type of setup is less likely to hit
this problem. I usually only hit it on routers connected via wan
links (redundant paths with ospf which flap on occasion). Usually
hit the memory problem a few times in fairly quick succession,
then not again for sometimes as much as a couple of months or
even longer.

Without having had a way to trigger it in the lab, and in my case not
much storage on the routers to save dumps, getting more information to
help track it down is challenging.. and of course I am reliant on
out-of-band access and needing to get the network back up at that
point, and often not fully awake having been woken by a text from
icinga, so very limited debug opportunities.

If you're better able to try and get some debug information, from what
we've worked out more recently I would suggest flapping the ospf links
as possibly triggering it.

Reply via email to