Re: unstable BGP sessions after debian 10 -> 11 upgrade

2021-10-24 Thread Alexander Shevchenko
Hi.
We faced similar issues during memory pressure. Some app leaked and caused
intense swapping. You could check sar -B/sar -W if you have sysstat
installed.
WBR,
Alexander Shevchenko


вс, 24 окт. 2021 г., 17:26 Christoph :

> Hello,
>
> we upgraded our debian BGP routers from debian buster to bullseye.
>
> On debian 10 they used the repo https://bird.network.cz/debian/
> now they use the bird2 package directly from the official debian repos,
> both repos contain BIRD version 2.0.7.
>
>
> After upgrading and rebooting we noticed that iBGP sessions are
> constantly dying every few minutes.
>
>
> The logs also show multiple of these log entries:
>
>  Kernel dropped some netlink messages, will resync on next scan.
>  Event ... took 31415 ms
>  I/O loop cycle took 31529 ms for 9 events
>  Netlink: File exists
>
>
> Especially the warning "Kernel dropped some netlink messages" was not
> new but we previously solved them with these sysctl.conf settings:
>
>
> # solves the "Kernel dropped some netlink messages, will resync on next
> scan." BIRD Warnings
> # https://bird.network.cz/pipermail/bird-users/2017-September/011542.html
> # (we did not change the wmem_default since this is apparently not needed)
> net.core.rmem_max=4194304
> net.core.rmem_default=4194304
> net.core.wmem_max=4194304
>
>
> The most obvious change is the linux kernel version:
>   4.19.208 (buster)
>   5.10.70 (bullseye)
>
> To avoid constantly dying iBGP sessions we increased the keepalive and
> hold timers.
> Are you observing similar issues on your debian 11 systems?
> Does the newer kernel need higher rmem values?
>
> best regards,
> Christoph
>


Re: BFD and Juniper SRX inter-op issue

2021-09-02 Thread Alexander Shevchenko
 SrcAddr (5) len 8: 10.10.255.1
Jun  9 15:56:14HoldTime (14) len 8: 10 sec 0 nsec
Jun  9 15:56:14NoAbsorb (15) len 1: True
Jun  9 15:56:14NoRefresh (16) len 1: True
Jun  9 15:56:14ForceRefresh (17) len 1: False
Jun  9 15:56:14DoNotAge (18) len 1: True
Jun  9 15:56:14Distribute (27) len 1: True
Jun  9 15:56:14LooseAuth (122) len 1: (hex) 00
Jun  9 15:56:14Discriminator (63) len 4: 0x1e
Jun  9 15:56:14DestAddr (8) len 8: 10.10.255.0
Jun  9 15:56:14RtblIdx (24) len 4: 10
Jun  9 15:56:14MinRecvTTL (68) len 1: 255
Jun  9 15:56:14RecvOnMhopPort (101) len 1: 0
Jun  9 15:56:14Unknown (153) len 1: (hex) 00
Jun  9 15:56:14Unknown (154) len 4: (hex) 00 00 00 00
Jun  9 15:56:14Unknown (165) len 4: (hex) 00 00 00 03
Jun  9 15:56:14Unknown (211) len 1: (hex) 04
Jun  9 15:56:14Unknown (167) len 1: (hex) 01
Jun  9 15:56:14 (bfdd_build_packet:2261) : Session 10.10.255.1 (IFL 568):
cur tx ivl 100

WBR,
Alexander Shevchenko

On Wed, Sep 1, 2021 at 5:03 PM Justin Cattle  wrote:

> Hi,
>
>
> Unfortunately not.  I hope we will raise a bug with Juniper, but it could
> take a while to get any resolution.
>
> It would also be interesting to know if there is something more Bird
> could/should be doing in this case - I hope for some developer feedback on
> the issue :)
>
>
> Cheers,
> Just
>
>
> On Mon, 23 Aug 2021 at 12:21, Oliver  wrote:
>
>> Hi Just,
>>
>> do you made any progress on this? We have the same problem with Deutsche
>> Telekom as Upstream provider. They also have Juniper Router.
>>
>> Best regards,
>>
>> Oliver
>>
>> On Tue, 10 Aug 2021, Justin Cattle wrote:
>>
>> > Forgot to mention, in the bird logs I see lofs of message such as this:
>> >
>> >  bfd1: Bad packet from 1.1.11.2 - unknown session id (0123456789)
>> >
>> >
>> > Cheers,
>> > Just
>> >
>> >
>> > On Tue, 10 Aug 2021 at 13:20, Justin Cattle  wrote:
>> >
>> > > Hi,
>> > >
>> > >
>> > > I have encountered what seems to be a bug of sorts in the Juniper
>> > > implementation of BFD in at least their SRX340.
>> > >
>> > > We have no issues with the QFX series, where BFD seems to work as
>> expected
>> > > with bird.
>> > >
>> > > I'm wondering if there is anything we can do to handle this issue on
>> the
>> > > bird side, or if anyone has any insight that may shed some light on
>> the
>> > > behaviour we are seeing.
>> > >
>> > > Here is the issue summary:
>> > >
>> > >- BFD timers are set quite conservatively
>> > >   - interval 4000 ms
>> > >   - multiplier 6
>> > >
>> > >
>> > >- A BFD session between a bird endpoint and a juniper endpoint is
>> up
>> > >and running at the start - all fine
>> > >- If the you stop bird on the server, after the Detection time [
>> > >currently 24 secs ], the BFD messages from the Juniper show status
>> as Down
>> > >with the Diagnostic message Control Detection Time Expired.  You
>> can then
>> > >start bird on the server again, and the two sides will agree
>> session info
>> > >and BFD status goes Up.  - This is expected.
>> > >- However, if you stop bird, but start it again before the
>> Detection
>> > >time [ currently 24 secs ], like for a service restart, the BFD
>> messages
>> > >from the Juniper never show as Down, and the two sides never agree
>> on a BFD
>> > >session and BFD remains Down on the server but Up on the Juniper.
>> - Should
>> > >a new session be established at this point ?
>> > >- Once the Juniper gets stuck in the BFD status Up state, then you
>> can
>> > >stop the bird for a long time [ over an hour at least ] , and the
>> Juniper
>> > >never seems to notice [ the BFD packets still show state Up ]. -
>> This seems
>> > >to be a bug n the juniper end - why should it never go Down in
>> this state ?
>> > >- If the BFD session info is reset on the Juniper side, then the
>> two
>> > >sides will agree session info and BFD status goes Up.
>> > >
>> > >
>> > > Does anyone have any thoughts ?
>> > >
>> > > Is there a packet bird can send, gratuitous or not, that can make the
>> > > juniper end realise it MUST reinitialize ?
>> > &