Re: Linux IPv6 routing strange behaviour
On Thu, Aug 15, 2013 at 11:35:50AM +0100, Phil Mayers wrote: On 15/08/13 11:31, Hannes Frederic Sowa wrote: On Thu, Aug 15, 2013 at 07:39:23AM +0200, Mikael Abrahamsson wrote: On Wed, 14 Aug 2013, Max Tulyev wrote: What is the soultion? There are *MILLIONS* of flows in the backbone... The solution is not to use a flow routing platform in the core. This lesson was learnt at the end of the 90ties. So until the linux ipv6 forwarding code is fixed to do stateless forwarding, it's just not suited for your application. Some time ago I started working on nh-exceptions, but it is a very delicate change. I hope I can look at this again as soon as I have some more free time. Because the data structures are already in place for IPv4 in the generic routing code it should be not such a big patch. I guess I'm a little bit confused by this thread. Why are nh-exceptions relevant to *forwarding* (as opposed to the host side of the stack, which of course needs to cache all kinds of bits per-destination) It is a common lookup path where the per host routing nodes get cloned and reinserted back into the fib. Or is that what you're saying - the host-based bits will live as exceptions on top of a stateless FIB? Yes, that would be the end result of this change. Also these entries will be added on demand, so, normally there won't be a lot of exceptions. This is a recent presentation about the IPv4 routing cache removal: http://workshop.netfilter.org/2013/wiki/images/2/2a/DaveM_route_cache_removed_nfws2013.pdf Greetings, Hannes
Re: Linux IPv6 routing strange behaviour
I have some additional info about the issue I found. Even if no traffic and no full-view, but a lot of interfaces (tunnel broker node is a good sample), the static routes are duplicating. That is definitely NOT a route cache described below, as route cache is pointing to the HOST, not to the network. On 15.08.13 13:54, Hannes Frederic Sowa wrote: On Thu, Aug 15, 2013 at 11:35:50AM +0100, Phil Mayers wrote: On 15/08/13 11:31, Hannes Frederic Sowa wrote: On Thu, Aug 15, 2013 at 07:39:23AM +0200, Mikael Abrahamsson wrote: On Wed, 14 Aug 2013, Max Tulyev wrote: What is the soultion? There are *MILLIONS* of flows in the backbone... The solution is not to use a flow routing platform in the core. This lesson was learnt at the end of the 90ties. So until the linux ipv6 forwarding code is fixed to do stateless forwarding, it's just not suited for your application. Some time ago I started working on nh-exceptions, but it is a very delicate change. I hope I can look at this again as soon as I have some more free time. Because the data structures are already in place for IPv4 in the generic routing code it should be not such a big patch. I guess I'm a little bit confused by this thread. Why are nh-exceptions relevant to *forwarding* (as opposed to the host side of the stack, which of course needs to cache all kinds of bits per-destination) It is a common lookup path where the per host routing nodes get cloned and reinserted back into the fib. Or is that what you're saying - the host-based bits will live as exceptions on top of a stateless FIB? Yes, that would be the end result of this change. Also these entries will be added on demand, so, normally there won't be a lot of exceptions. This is a recent presentation about the IPv4 routing cache removal: http://workshop.netfilter.org/2013/wiki/images/2/2a/DaveM_route_cache_removed_nfws2013.pdf Greetings, Hannes
Re: Linux IPv6 routing strange behaviour
Just ad a datapoint to Max' last remark, at sixxs we moved away from kernel based routing by implementing ipv6 routing in userspace (taking tap input and raw socket output) largely because of neighbor cache pollution and a streak of crashes when we started scaling beyond say 2000 interfaces. Pim On Aug 15, 2013 1:07 PM, Max Tulyev max...@netassist.ua wrote: I have some additional info about the issue I found. Even if no traffic and no full-view, but a lot of interfaces (tunnel broker node is a good sample), the static routes are duplicating. That is definitely NOT a route cache described below, as route cache is pointing to the HOST, not to the network. On 15.08.13 13:54, Hannes Frederic Sowa wrote: On Thu, Aug 15, 2013 at 11:35:50AM +0100, Phil Mayers wrote: On 15/08/13 11:31, Hannes Frederic Sowa wrote: On Thu, Aug 15, 2013 at 07:39:23AM +0200, Mikael Abrahamsson wrote: On Wed, 14 Aug 2013, Max Tulyev wrote: What is the soultion? There are *MILLIONS* of flows in the backbone... The solution is not to use a flow routing platform in the core. This lesson was learnt at the end of the 90ties. So until the linux ipv6 forwarding code is fixed to do stateless forwarding, it's just not suited for your application. Some time ago I started working on nh-exceptions, but it is a very delicate change. I hope I can look at this again as soon as I have some more free time. Because the data structures are already in place for IPv4 in the generic routing code it should be not such a big patch. I guess I'm a little bit confused by this thread. Why are nh-exceptions relevant to *forwarding* (as opposed to the host side of the stack, which of course needs to cache all kinds of bits per-destination) It is a common lookup path where the per host routing nodes get cloned and reinserted back into the fib. Or is that what you're saying - the host-based bits will live as exceptions on top of a stateless FIB? Yes, that would be the end result of this change. Also these entries will be added on demand, so, normally there won't be a lot of exceptions. This is a recent presentation about the IPv4 routing cache removal: http://workshop.netfilter.org/2013/wiki/images/2/2a/DaveM_route_cache_removed_nfws2013.pdf Greetings, Hannes
Re: Linux IPv6 routing strange behaviour
On Thu, Aug 15, 2013 at 02:08:01PM +0300, Max Tulyev wrote: I have some additional info about the issue I found. Even if no traffic and no full-view, but a lot of interfaces (tunnel broker node is a good sample), the static routes are duplicating. That is definitely NOT a route cache described below, as route cache is pointing to the HOST, not to the network. Can you give me your kernel version and give me an excerpt of /proc/net/ipv6_route where this is happening? There was a smal fallout because of the rt-neighbour removal in recent kernels. Thanks, Hannes
Re: Linux IPv6 routing strange behaviour
If so - things are much worse than I afraid... The only question is why to implement user space routing instead of fixing the kernel code? On 15.08.13 14:14, Pim van Pelt wrote: Just ad a datapoint to Max' last remark, at sixxs we moved away from kernel based routing by implementing ipv6 routing in userspace (taking tap input and raw socket output) largely because of neighbor cache pollution and a streak of crashes when we started scaling beyond say 2000 interfaces.
RE: Linux IPv6 routing strange behaviour
Because it's faster. http://blog.erratasec.com/2013/02/custom-stack-it-goes-to-11.html A few more juicy Unix comments here: http://blog.erratasec.com/2013/02/unlearning-college.html Enjoy! Ivan -Original Message- From: ipv6-ops-bounces+ipepelnjak=gmail@lists.cluenet.de [mailto:ipv6-ops-bounces+ipepelnjak=gmail@lists.cluenet.de] On Behalf Of Max Tulyev Sent: Thursday, August 15, 2013 1:36 PM To: ipv6-ops@lists.cluenet.de Subject: Re: Linux IPv6 routing strange behaviour If so - things are much worse than I afraid... The only question is why to implement user space routing instead of fixing the kernel code? On 15.08.13 14:14, Pim van Pelt wrote: Just ad a datapoint to Max' last remark, at sixxs we moved away from kernel based routing by implementing ipv6 routing in userspace (taking tap input and raw socket output) largely because of neighbor cache pollution and a streak of crashes when we started scaling beyond say 2000 interfaces.
Re: Linux IPv6 routing strange behaviour
On Thu, Aug 15, 2013 at 02:11:32PM +0200, Hannes Frederic Sowa wrote: [Sorry, missed the list] On Thu, Aug 15, 2013 at 02:38:30PM +0300, Max Tulyev wrote: Hi Hannes, The situation is same on 2.6.36-gentoo-r8 and 3.10.6-gentoo. 3.10.6-gentoo is a bit worse: quagga/bgpd is hang at start-up in most cases. This is happening without the router forwarding packets? cat /proc/net/ipv6_route cat: /proc/net/ipv6_route: Cannot allocate memory Could you try ip -6 route list table all instead? I would be interested in the cloned network routes. It could also be because of equal cost multipathing or IPV6_SUBTREES. But I actually need the flags on the routes, so perhaps you could drop some routes when importing the full-feed in quagga? You can also monitor routing insertion/deletion with ip -6 monitor route.
Re: Linux IPv6 routing strange behaviour
On 2013-08-15 13:26, Phil Mayers wrote: On 15/08/13 12:14, Pim van Pelt wrote: Just ad a datapoint to Max' last remark, at sixxs we moved away from kernel based routing by implementing ipv6 routing in userspace (taking tap input and raw socket output) largely because of neighbor cache Interesting. Was this custom/proprietary software or is it available somewhere? To add to Pim's comments: It is quite specific to the problems that SixXS PoPs have: Large amount of tunnels and routes Also note that these tunnels are dynamic and thus endpoints change all the time. The Linux kernel (nor likely any other kernel) is just not (and likely will never) be designed for what the SixXS PoPs do. We saw random 'forgetting' of _static_ routing entries, and even tunnel interfaces going missing and other weird effects without any error/warnings whatsoever; thus what really happened is a mystery. The routing logic along with the caching/neighbor lookups etc on top of those issues did not help at all either. Note that the same goes for FreeBSD/NetBSD/OpenBSD/OSX from our testing (yes, we checked if OSX was smarter about it, it is not ;) From our testing, performance characteristics are mostly the same when running sixxsd on the above platforms: it fills about 10G of tunneled traffic on a virtual interface on a i7 3.4Ghz. (Simulated traffic, but as everything is a static non-locking lookup that should be quite okay ;) If we ever hit the limits of that setup, we can always think about adding some threads or so to use the other cpus (hence why I don't mention quad-core above)... Since deploying it we then also have not had any issues with the PoPs themselves anymore except for hardware outages or routing issues outside on the network itself. (code can't solve those... yet ;) sixxsd is available for use solely by SixXS PoPs, but as said, it is solving a very specific problem that one likely does not have outside the scope of this. Thus it likely won't solve any problem you are having: as always, actually defining the problem one has might lead to a solution. Some more details are available here: http://www.sixxs.net/faq/sixxs/?faq=sixxsd As a bonus, this is how the routing table of deham01 looks like: 8-- root@deham01:~# ip -6 ro show 2001:6f8:862:1::/64 dev eth0 proto kernel metric 256 mtu 1500 advmss 1440 hoplimit 4294967295 2001:6f8:900:::1 dev sixxs metric 1024 mtu 1500 advmss 1440 hoplimit 4294967295 2001:6f8:900::/48 via 2001:6f8:900:::1 dev sixxs metric 1024 mtu 1500 advmss 1440 hoplimit 4294967295 2001:6f8:900::/40 via 2001:6f8:900:::1 dev sixxs metric 1024 mtu 1500 advmss 1440 hoplimit 4294967295 2001:6f8:1000::/40 via 2001:6f8:900:::1 dev sixxs metric 1024 mtu 1500 advmss 1440 hoplimit 4294967295 2001:6f8:1100::/40 via 2001:6f8:900:::1 dev sixxs metric 1024 mtu 1500 advmss 1440 hoplimit 4294967295 2001:6f8:1200::/40 via 2001:6f8:900:::1 dev sixxs metric 1024 mtu 1500 advmss 1440 hoplimit 4294967295 2001:6f8:1300::/40 via 2001:6f8:900:::1 dev sixxs metric 1024 mtu 1500 advmss 1440 hoplimit 4294967295 fe80::/64 dev eth0 proto kernel metric 256 mtu 1500 advmss 1440 hoplimit 4294967295 default via fe80::5:73ff:fea0:1 dev eth0 metric 1024 mtu 1500 advmss 1440 hoplimit 4294967295 --8 Yes, that is 5 /40s worth of address space and everything is piped into the sixxs interface to a single neighbor that lives on the tapped interface. We thus indeed hit the Linux routing logic a bit, but as the table is small and it is a single neighbor nothing much dynamic happens there. ip -6 monitor route is thus nice an silent. Greets, Jeroen
Re: Linux IPv6 routing strange behaviour
On Thu, 15 Aug 2013, Jeroen Massar wrote: Yes, that is 5 /40s worth of address space and everything is piped into the sixxs interface to a single neighbor that lives on the tapped interface. We thus indeed hit the Linux routing logic a bit, but as the table is small and it is a single neighbor nothing much dynamic happens there. ip -6 monitor route is thus nice an silent. So you're actually not seeing any flow based routing here? cat /proc/net/ipv6_route contains just those routes you see in ip -6 r show? Because in my linux kernel 3.2 based machines I have a lot more entries in cat /proc/net/ipv6_route than I have routes. -- Mikael Abrahamssonemail: swm...@swm.pp.se
Re: Linux IPv6 routing strange behaviour
On 15.08.13 15:14, Hannes Frederic Sowa wrote: You can also monitor routing insertion/deletion with ip -6 monitor route. Yes! I think it shows the problem more. There are a lot of this errors: netlink receive error No buffer space available (105) What is it? Here it is the sample: Deleted 2001:7fb:ff02::/48 via 2a01:d0:0:1c::f9 dev eth1.778 proto zebra metric 1024 mtu 1500 advmss 1440 hoplimit 0 2001:7fb:ff02::/48 via 2a01:d0:0:1c::f9 dev eth1.778 proto zebra metric 1024 mtu 1500 advmss 1440 hoplimit 0 netlink receive error No buffer space available (105) Deleted 2001:7fb:ff02::/48 via 2a01:d0:0:1c::f9 dev eth1.778 proto zebra metric 1024 mtu 1500 advmss 1440 hoplimit 0 Deleted 2001:67c:1ec::/48 via fe80::92e2:baff:fe16:7e9 dev eth1.778 proto zebra metric 1024 mtu 1500 advmss 1440 hoplimit 0 2001:67c:1ec::/48 via fe80::92e2:baff:fe16:7e9 dev eth1.778 proto zebra metric 1024 mtu 1500 advmss 1440 hoplimit 0 netlink receive error No buffer space available (105) Deleted 2a01:bec0::/32 via fe80::21b:21ff:febf:96b4 dev eth1.778 proto zebra metric 1024 mtu 1500 advmss 1440 hoplimit 0 2a01:bec0::/32 via fe80::21b:21ff:febf:96b4 dev eth1.778 proto zebra metric 1024 mtu 1500 advmss 1440 hoplimit 0 Deleted 2804:548::/32 via fe80::92e2:baff:fe16:7e9 dev eth1.778 proto zebra metric 1024 mtu 1500 advmss 1440 hoplimit 0 Deleted 2001:67c:2884::/48 via fe80::92e2:baff:fe16:7e9 dev eth1.778 proto zebra metric 100 mtu 1500 advmss 1440 hoplimit 0 2001:67c:2884::/48 via fe80::92e2:baff:fe16:7e9 dev eth1.778 proto zebra metric 5500 mtu 1500 advmss 1440 hoplimit 0 Deleted 2001:67c:2884::/48 via fe80::92e2:baff:fe16:7e9 dev eth1.778 proto zebra metric 5500 mtu 1500 advmss 1440 hoplimit 0 2001:67c:2884::/48 via fe80::92e2:baff:fe16:7e9 dev eth1.778 proto zebra metric 6600 mtu 1500 advmss 1440 hoplimit 0 Deleted 2a01:bec0::/32 via fe80::21b:21ff:febf:96b4 dev eth1.778 proto zebra metric 1024 mtu 1500 advmss 1440 hoplimit 0 2a01:bec0::/32 via fe80::21b:21ff:febf:96b4 dev eth1.778 proto zebra metric 1024 mtu 1500 advmss 1440 hoplimit 0 Deleted 2a00:1a58::/32 via fe80::92e2:baff:fe16:7e9 dev eth1.778 proto zebra metric 1024 mtu 1500 advmss 1440 hoplimit 0 2a00:1a58::/32 via fe80::92e2:baff:fe16:7e9 dev eth1.778 proto zebra metric 1024 mtu 1500 advmss 1440 hoplimit 0 2607:f088::/32 via fe80::92e2:baff:fe16:7e9 dev eth1.778 proto zebra metric 1024 mtu 1500 advmss 1440 hoplimit 0 Deleted 2a00:1a58::/32 via fe80::92e2:baff:fe16:7e9 dev eth1.778 proto zebra metric 1024 mtu 1500 advmss 1440 hoplimit 0 2a00:1a58::/32 via fe80::92e2:baff:fe16:7e9 dev eth1.778 proto zebra metric 1024 mtu 1500 advmss 1440 hoplimit 0 2804:548::/32 via fe80::92e2:baff:fe16:7e9 dev eth1.778 proto zebra metric 1024 mtu 1500 advmss 1440 hoplimit 0 Deleted 2607:f088::/32 via fe80::92e2:baff:fe16:7e9 dev eth1.778 proto zebra metric 1024 mtu 1500 advmss 1440 hoplimit 0 2607:f088::/32 via fe80::92e2:baff:fe16:7e9 dev eth1.778 proto zebra metric 1 mtu 1500 advmss 1440 hoplimit 0 Deleted 2001:67c:2884::/48 via fe80::92e2:baff:fe16:7e9 dev eth1.778 proto zebra metric 6600 mtu 1500 advmss 1440 hoplimit 0 2001:67c:2884::/48 via fe80::92e2:baff:fe16:7e9 dev eth1.778 proto zebra metric 100 mtu 1500 advmss 1440 hoplimit 0 netlink receive error No buffer space available (105) Deleted 2001:67c:1ec::/48 via fe80::92e2:baff:fe16:7e9 dev eth1.778 proto zebra metric 1024 mtu 1500 advmss 1440 hoplimit 0 2001:67c:1ec::/48 via fe80::92e2:baff:fe16:7e9 dev eth1.778 proto zebra metric 1024 mtu 1500 advmss 1440 hoplimit 0 Deleted 2a00:1a58::/32 via fe80::92e2:baff:fe16:7e9 dev eth1.778 proto zebra metric 1024 mtu 1500 advmss 1440 hoplimit 0 2a00:1a58::/32 via fe80::92e2:baff:fe16:7e9 dev eth1.778 proto zebra metric 1024 mtu 1500 advmss 1440 hoplimit 0 netlink receive error No buffer space available (105) Deleted 2001:67c:1ec::/48 via fe80::92e2:baff:fe16:7e9 dev eth1.778 proto zebra metric 1024 mtu 1500 advmss 1440 hoplimit 0 2001:67c:1ec::/48 via fe80::92e2:baff:fe16:7e9 dev eth1.778 proto zebra metric 1024 mtu 1500 advmss 1440 hoplimit 0
Re: Linux IPv6 routing strange behaviour
On 2013-08-15 14:41, Mikael Abrahamsson wrote: On Thu, 15 Aug 2013, Jeroen Massar wrote: Yes, that is 5 /40s worth of address space and everything is piped into the sixxs interface to a single neighbor that lives on the tapped interface. We thus indeed hit the Linux routing logic a bit, but as the table is small and it is a single neighbor nothing much dynamic happens there. ip -6 monitor route is thus nice an silent. So you're actually not seeing any flow based routing here? cat /proc/net/ipv6_route contains just those routes you see in ip -6 r show? Because in my linux kernel 3.2 based machines I have a lot more entries in cat /proc/net/ipv6_route than I have routes. That is correct. Though on 2.6 you won't see those there from what I recall, on 3.2 you will indeed see them. In our case that means that the tunnels are not amongst them (and that is where the majority of endpoints for us are, hence at minimum half the table entries), while the uplink (which is a default route) will cause a the packet to go through Linux's kernel and create the same entry over and over. We could likely avoid that if we wanted to, by sending the packet ourselves to gateway and thus skipping the kernel's routing completely. As the scaling[2] and performance is already much better (and we do not have the randomly dropping interfaces[1] , and overhead is already minimal enough, we did not bother doing that yet. Greets, Jeroen [1] Linux kernel uses a hashtable that can collide when there are lots of tunnels; but as we know the address space layout anyway, we do not have to bother with that. [2] I recall that the interface table used to/is a linked list...
Re: Linux IPv6 routing strange behaviour
On Thu, Aug 15, 2013 at 04:06:11PM +0300, Max Tulyev wrote: On 15.08.13 15:14, Hannes Frederic Sowa wrote: You can also monitor routing insertion/deletion with ip -6 monitor route. Yes! I think it shows the problem more. There are a lot of this errors: I don't see timestamp but I guess you have a massive churn in the ip6_fib because of the deletion and immediate insertion of prefixes. netlink receive error No buffer space available (105) netlink are sockets, too, and they can run out of receive buffer. Try ip -rc with a higher value then 1048576. Max value is /proc/sys/net/core/rmem_max/rmem_max (you can increase this, too). Thanks for the dumps, I will have a closer look later today. Greetings, Hannes
Re: Linux IPv6 routing strange behaviour
Hi All, I found exact the problem I described is *NOT* a kernel or Quagga/BIRD problem. It is the bug of the ip utility. So this time it is not affecting the routing itself. I found that ip -6 route show outputs a lot of strings - the random number from 1 to 100, with randomly repeated blocks of routes. On this same server, the netstat -6rn continuously returns the same ~13500 routes without any repeating ;) But the other bugs like routing stupidity, blank output in some condition on netstat -6rn remains...