We've found our smoking gun.

During high bursts of disk IO, bird can block when attempting to write to a
log file. This write is fully blocking, so if the disk is under contention,
it can last for a while.

In our lab environment, we've induced bird to block for more than 40
seconds when the disk is under heavy contention, which is the default dead
time in OSPF. This applies to the latest stable (1.4.5). You can easily
recreate with a bash one-liner: "dd if=/dev/zero of=/some/valid/location"
where /some/valid/location is a writable file.

We've found the simplest way to work around the issue is to log to syslog
instead of directly to a file. Logging to syslog prevents the blocking
behavior for both bird 1.3.7 and 1.4.5.

I find the behavior surprising, as bird in all other situations that I am
aware of does not block on IO operations. It only uses blocking IO when
writing to the log file.

Whether this is a problem in general or not I leave up to others.

Aleksey: we'll definitely be looking into BFD going forward. Thanks for the
suggestion!



On Tue, Nov 4, 2014 at 3:37 AM, Aleksey Berezin <[email protected]>
wrote:

> Hi.
>
> > In terms of unusual settings, we have some rather aggressive OSPF hello
> and
> > dead timers set. Hellos are set to 1 a second, and dead set to 3 seconds.
>
> In this case better use OSPF+BFD solution - it's much more stable +
> you can get less-second failover.
>
> On Tue, Nov 4, 2014 at 2:52 AM, Alex Laties <[email protected]> wrote:
> > Hey Ondrej,
> >
> > So, we've had debug enabled from the start of our deployment. We
> currently
> > have "debug protocols all" set.
> >
> > From what I can glean from the logs, these messages appear after some
> time
> > after the adjacency has been established.
> >
> > Our current deployment has juniper routing instances talking OSPF, as
> well
> > as linux boxes talking OSPF via bird.
> >
> > The OSPF interface is in broadcast mode I believe.
> >
> > In terms of unusual settings, we have some rather aggressive OSPF hello
> and
> > dead timers set. Hellos are set to 1 a second, and dead set to 3 seconds.
> >
> > What we tend to see in our logs from bird is that occasionally, bird
> fails
> > to send a Hello packet for 2 or 3 seconds. More specifically, we see
> gaps of
> > 2 to 3 seconds in the log file.
> >
> > We've disabled debug on one of our active nodes. The frequency for that
> node
> > to go from Full to Down is significantly lower than it's peers at the
> moment
> > (once a day vs once or twice every couple of hours).
> >
> > We're testing bird 1.4.5 on one of our nodes for the next 48 hours and
> will
> > report back with results there.
> >
> > On Mon, Nov 3, 2014 at 5:35 AM, Ondrej Zajicek <[email protected]>
> > wrote:
> >>
> >> On Mon, Oct 27, 2014 at 09:42:32PM -0400, Alex Laties wrote:
> >> > Hi all,
> >> >
> >> > We currently have a large production deployment using version bird
> 1.3.7
> >> > for OSPF.
> >> >
> >> > We're seeing the following message pretty frequently in our logs:
> >> >
> >> > > dbdes - sequence mismatch neighbor 192.168.39.216 (full)
> >> >
> >> > The period between these messages is irregular. Sometimes these occur
> >> > within a few seconds of each other. Sometimes it can be a few hours
> >> > between
> >> > these messages.
> >>
> >> Hi
> >>
> >> These messages are the result of receiving DBDES packets when a neighbor
> >> adjacency is already established. This shouldn't happen in normal
> >> operation, although i would guess it might happen in some circumstances
> >> if the other side is hard restarted and became available again before
> the
> >> other side notices it (by inactivity timer).
> >>
> >>
> >> First, i would suggest to use latest version of BIRD.
> >>
> >> Second, i would suggest enabling 'debug { events }' for OSPF protocol
> >> to see what happens on boths sides immediately before the mismatch.
> >>
> >> Are these messages appear just after the neighbor changed state to full
> >> or after some time after the adjacency establishment?
> >>
> >> Is the other side also BIRD?
> >>
> >> Is the OSPF interface in broadcast or ptp mode?
> >>
> >> Is this regular or some kind of unusual setting?
> >>
> >>
> >> --
> >> Elen sila lumenn' omentielvo
> >>
> >> Ondrej 'Santiago' Zajicek (email: [email protected])
> >> OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net)
> >> "To err is human -- to blame it on a computer is even more so."
> >
> >
>

Reply via email to