Hi,

so I have been looking into this as well and produced a flamegraph of babeld-1.9.2 running for about 10 minutes. This instance was connected to one peer which is part of the same network Fabian described. The peer was restarted once to force some updates.

I uploaded the interactive flamegraph here [1] (you may need to download it and open in a browser to be able to zoom in)

To my eye, conflict_solution() [2] really stands out. And just glancing at the code it roughly looks like every route is compared to every route, which to me sounds like quadratic complexity. My guess is that it's even worse, close to cubic complexity, since this function is called on route installs in [3].

git-blame shows that the code is more than 5 years old. So if this is indeed our problem, it's not new, we just grew into it.

Cheers,

Johannes

[1] https://gist.githubusercontent.com/lemmi/901724d6bc658e0a5a21d8d9fede18d1/raw/f7e1006f4a915a06ab1d88b4295fee7c7e07a28c/babel.svg

[2] https://github.com/jech/babeld/blob/d42c1dbdfd6dc836c7b0c8e32657d0be5f0b2b84/disambiguation.c#L170

[3] https://github.com/jech/babeld/blob/d42c1dbdfd6dc836c7b0c8e32657d0be5f0b2b84/disambiguation.c#L323

On 31.05.20 13:13, Fabian Bläse wrote:
Hi,

sorry for the late reply.

On 15.05.20 01:18, Juliusz Chroboczek wrote:
Ever since we have upgraded to babeld 1.9.x, CPU usage is a lot higher
than with 1.8.x. Especially slower devices like embedded MIPS routers
are having trouble to keep up, which leads to route instablity due to
late helos.
That sounds pretty bad.

In principle, 1.9 should use less CPU than 1.8, since some algorithms have
been reworked to be in n·logn instead of n².  On the other hand, 1.9 uses
more memory, since there is now a per-neighbour unicast buffer instead of
a single buffer for all neighbours; this shouldn't matter much in practice,
except if you have hundreds of neighbours.
I've tried analysing this in more detail using the debug output.
It looks like the actual route algorithms are not the problem, but the 
communication with netlink.

If the network gets unstable somewhere upstream, a lot of unreachable routes 
are sent through the downstream network (for all the routes from the upstream 
network, that were lost now).
However, the netlink interface seems to be relatively slow, especially on the 
device we are having a lot of trouble with (TP Link WDR 3600, Atheros AR9344, 
74Kc MIPS 560 MHz).
Installing all these unreachable routes takes so long, that the relatively 
small socket buffer for babel messages overflows, because it is not read while 
route updates are sent to netlink. That leads to loss babel messages.

That initiates a state babeld is unlikely to recover from, because changing the 
state of all the routes in the kernel (reachable, unreachable) always takes so 
long, that new babel messages are lost.

The issue probably can only be fixed if route updates are not sent to netlink 
synchronously.

I'm not really shure, why this only occurs with babeld >=1.9.0.
I looks like I got a little confused with version numbers, so I might have 
tested with versions that still had the IPv4 xroute issue [1].

Perhaps you could provide us with a CPU profile?
I don't really know, what you mean.

Regards,
Fabian

[1] https://github.com/jech/babeld/issues/46


_______________________________________________
Babel-users mailing list
[email protected]
https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users
_______________________________________________
Babel-users mailing list
[email protected]
https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users

Reply via email to