Re: Linux IPv6 routing strange behaviour

2013-08-15 Thread Hannes Frederic Sowa
On Thu, Aug 15, 2013 at 11:35:50AM +0100, Phil Mayers wrote:
 On 15/08/13 11:31, Hannes Frederic Sowa wrote:
 On Thu, Aug 15, 2013 at 07:39:23AM +0200, Mikael Abrahamsson wrote:
 On Wed, 14 Aug 2013, Max Tulyev wrote:
 
 What is the soultion? There are *MILLIONS* of flows in the backbone...
 
 The solution is not to use a flow routing platform in the core. This
 lesson was learnt at the end of the 90ties.
 
 So until the linux ipv6 forwarding code is fixed to do stateless
 forwarding, it's just not suited for your application.
 
 Some time ago I started working on nh-exceptions, but it is a very
 delicate change. I hope I can look at this again as soon as I have some
 more free time. Because the data structures are already in place for
 IPv4 in the generic routing code it should be not such a big patch.
 
 I guess I'm a little bit confused by this thread.
 
 Why are nh-exceptions relevant to *forwarding* (as opposed to the host 
 side of the stack, which of course needs to cache all kinds of bits 
 per-destination)

It is a common lookup path where the per host routing nodes get cloned and
reinserted back into the fib.

 Or is that what you're saying - the host-based bits will live as 
 exceptions on top of a stateless FIB?

Yes, that would be the end result of this change. Also these entries will be
added on demand, so, normally there won't be a lot of exceptions.

This is a recent presentation about the IPv4 routing cache removal:
http://workshop.netfilter.org/2013/wiki/images/2/2a/DaveM_route_cache_removed_nfws2013.pdf

Greetings,

  Hannes



Re: Linux IPv6 routing strange behaviour

2013-08-15 Thread Max Tulyev
I have some additional info about the issue I found.

Even if no traffic and no full-view, but a lot of interfaces (tunnel
broker node is a good sample), the static routes are duplicating.

That is definitely NOT a route cache described below, as route cache is
pointing to the HOST, not to the network.


On 15.08.13 13:54, Hannes Frederic Sowa wrote:
 On Thu, Aug 15, 2013 at 11:35:50AM +0100, Phil Mayers wrote:
 On 15/08/13 11:31, Hannes Frederic Sowa wrote:
 On Thu, Aug 15, 2013 at 07:39:23AM +0200, Mikael Abrahamsson wrote:
 On Wed, 14 Aug 2013, Max Tulyev wrote:

 What is the soultion? There are *MILLIONS* of flows in the backbone...

 The solution is not to use a flow routing platform in the core. This
 lesson was learnt at the end of the 90ties.

 So until the linux ipv6 forwarding code is fixed to do stateless
 forwarding, it's just not suited for your application.

 Some time ago I started working on nh-exceptions, but it is a very
 delicate change. I hope I can look at this again as soon as I have some
 more free time. Because the data structures are already in place for
 IPv4 in the generic routing code it should be not such a big patch.

 I guess I'm a little bit confused by this thread.

 Why are nh-exceptions relevant to *forwarding* (as opposed to the host 
 side of the stack, which of course needs to cache all kinds of bits 
 per-destination)
 
 It is a common lookup path where the per host routing nodes get cloned and
 reinserted back into the fib.
 
 Or is that what you're saying - the host-based bits will live as 
 exceptions on top of a stateless FIB?
 
 Yes, that would be the end result of this change. Also these entries will be
 added on demand, so, normally there won't be a lot of exceptions.
 
 This is a recent presentation about the IPv4 routing cache removal:
 http://workshop.netfilter.org/2013/wiki/images/2/2a/DaveM_route_cache_removed_nfws2013.pdf
 
 Greetings,
 
   Hannes
 
 



Re: Linux IPv6 routing strange behaviour

2013-08-15 Thread Pim van Pelt
Just ad a datapoint to Max' last remark, at sixxs we moved away from kernel
based routing by implementing ipv6 routing in userspace (taking tap input
and raw socket output) largely because of neighbor cache pollution and a
streak of crashes when we started scaling beyond say 2000 interfaces.

Pim
On Aug 15, 2013 1:07 PM, Max Tulyev max...@netassist.ua wrote:

 I have some additional info about the issue I found.

 Even if no traffic and no full-view, but a lot of interfaces (tunnel
 broker node is a good sample), the static routes are duplicating.

 That is definitely NOT a route cache described below, as route cache is
 pointing to the HOST, not to the network.


 On 15.08.13 13:54, Hannes Frederic Sowa wrote:
  On Thu, Aug 15, 2013 at 11:35:50AM +0100, Phil Mayers wrote:
  On 15/08/13 11:31, Hannes Frederic Sowa wrote:
  On Thu, Aug 15, 2013 at 07:39:23AM +0200, Mikael Abrahamsson wrote:
  On Wed, 14 Aug 2013, Max Tulyev wrote:
 
  What is the soultion? There are *MILLIONS* of flows in the
 backbone...
 
  The solution is not to use a flow routing platform in the core. This
  lesson was learnt at the end of the 90ties.
 
  So until the linux ipv6 forwarding code is fixed to do stateless
  forwarding, it's just not suited for your application.
 
  Some time ago I started working on nh-exceptions, but it is a very
  delicate change. I hope I can look at this again as soon as I have some
  more free time. Because the data structures are already in place for
  IPv4 in the generic routing code it should be not such a big patch.
 
  I guess I'm a little bit confused by this thread.
 
  Why are nh-exceptions relevant to *forwarding* (as opposed to the host
  side of the stack, which of course needs to cache all kinds of bits
  per-destination)
 
  It is a common lookup path where the per host routing nodes get cloned
 and
  reinserted back into the fib.
 
  Or is that what you're saying - the host-based bits will live as
  exceptions on top of a stateless FIB?
 
  Yes, that would be the end result of this change. Also these entries
 will be
  added on demand, so, normally there won't be a lot of exceptions.
 
  This is a recent presentation about the IPv4 routing cache removal:
  
 http://workshop.netfilter.org/2013/wiki/images/2/2a/DaveM_route_cache_removed_nfws2013.pdf
 
 
  Greetings,
 
Hannes
 
 




Re: Linux IPv6 routing strange behaviour

2013-08-15 Thread Hannes Frederic Sowa
On Thu, Aug 15, 2013 at 02:08:01PM +0300, Max Tulyev wrote:
 I have some additional info about the issue I found.
 
 Even if no traffic and no full-view, but a lot of interfaces (tunnel
 broker node is a good sample), the static routes are duplicating.
 
 That is definitely NOT a route cache described below, as route cache is
 pointing to the HOST, not to the network.

Can you give me your kernel version and give me an excerpt of
/proc/net/ipv6_route where this is happening?

There was a smal fallout because of the rt-neighbour removal in recent
kernels.

Thanks,

  Hannes


Re: Linux IPv6 routing strange behaviour

2013-08-15 Thread Max Tulyev
If so - things are much worse than I afraid...

The only question is why to implement user space routing instead of
fixing the kernel code?

On 15.08.13 14:14, Pim van Pelt wrote:
 Just ad a datapoint to Max' last remark, at sixxs we moved away from
 kernel based routing by implementing ipv6 routing in userspace (taking
 tap input and raw socket output) largely because of neighbor cache
 pollution and a streak of crashes when we started scaling beyond say
 2000 interfaces.



RE: Linux IPv6 routing strange behaviour

2013-08-15 Thread Ivan Pepelnjak
Because it's faster.

http://blog.erratasec.com/2013/02/custom-stack-it-goes-to-11.html

A few more juicy Unix comments here:

http://blog.erratasec.com/2013/02/unlearning-college.html

Enjoy!
Ivan

-Original Message-
From: ipv6-ops-bounces+ipepelnjak=gmail@lists.cluenet.de
[mailto:ipv6-ops-bounces+ipepelnjak=gmail@lists.cluenet.de] On Behalf Of
Max Tulyev
Sent: Thursday, August 15, 2013 1:36 PM
To: ipv6-ops@lists.cluenet.de
Subject: Re: Linux IPv6 routing strange behaviour

If so - things are much worse than I afraid...

The only question is why to implement user space routing instead of fixing
the kernel code?

On 15.08.13 14:14, Pim van Pelt wrote:
 Just ad a datapoint to Max' last remark, at sixxs we moved away from 
 kernel based routing by implementing ipv6 routing in userspace (taking 
 tap input and raw socket output) largely because of neighbor cache 
 pollution and a streak of crashes when we started scaling beyond say
 2000 interfaces.



Re: Linux IPv6 routing strange behaviour

2013-08-15 Thread Hannes Frederic Sowa
On Thu, Aug 15, 2013 at 02:11:32PM +0200, Hannes Frederic Sowa wrote:
 [Sorry, missed the list]
 
 On Thu, Aug 15, 2013 at 02:38:30PM +0300, Max Tulyev wrote:
  Hi Hannes,
  
  The situation is same on 2.6.36-gentoo-r8 and 3.10.6-gentoo.
  3.10.6-gentoo is a bit worse: quagga/bgpd is hang at start-up in most cases.
 
 This is happening without the router forwarding packets?
 
  
  cat /proc/net/ipv6_route
  cat: /proc/net/ipv6_route: Cannot allocate memory
 
 Could you try ip -6 route list table all instead? I would be interested in the
 cloned network routes. It could also be because of equal cost multipathing or
 IPV6_SUBTREES.
 
 But I actually need the flags on the routes, so perhaps you could drop some
 routes when importing the full-feed in quagga?

You can also monitor routing insertion/deletion with ip -6 monitor route.



Re: Linux IPv6 routing strange behaviour

2013-08-15 Thread Jeroen Massar
On 2013-08-15 13:26, Phil Mayers wrote:
 On 15/08/13 12:14, Pim van Pelt wrote:
 Just ad a datapoint to Max' last remark, at sixxs we moved away from
 kernel based routing by implementing ipv6 routing in userspace (taking
 tap input and raw socket output) largely because of neighbor cache
 
 Interesting. Was this custom/proprietary software or is it available
 somewhere?

To add to Pim's comments:

It is quite specific to the problems that SixXS PoPs have:
 Large amount of tunnels and routes

Also note that these tunnels are dynamic and thus endpoints change all the time.

The Linux kernel (nor likely any other kernel) is just not (and likely will 
never) be designed for what the SixXS PoPs do. We saw random 'forgetting' of 
_static_ routing entries, and even tunnel interfaces going missing and other 
weird effects without any error/warnings whatsoever; thus what really happened 
is a mystery.

The routing logic along with the caching/neighbor lookups etc on top of those 
issues did not help at all either. Note that the same goes for 
FreeBSD/NetBSD/OpenBSD/OSX from our testing (yes, we checked if OSX was smarter 
about it, it is not ;)

From our testing, performance characteristics are mostly the same when running 
sixxsd on the above platforms: it fills about 10G of tunneled traffic on a 
virtual interface on a i7 3.4Ghz. (Simulated traffic, but as everything is a 
static non-locking lookup that should be quite okay ;) If we ever hit the 
limits of that setup, we can always think about adding some threads or so to 
use the other cpus (hence why I don't mention quad-core above)...

Since deploying it we then also have not had any issues with the PoPs 
themselves anymore except for hardware outages or routing issues outside on the 
network itself. (code can't solve those... yet ;)

sixxsd is available for use solely by SixXS PoPs, but as said, it is solving a 
very specific problem that one likely does not have outside the scope of this. 
Thus it likely won't solve any problem you are having: as always, actually 
defining the problem one has might lead to a solution.

Some more details are available here:
 http://www.sixxs.net/faq/sixxs/?faq=sixxsd

As a bonus, this is how the routing table of deham01 looks like:
8--
root@deham01:~# ip -6 ro show
2001:6f8:862:1::/64 dev eth0  proto kernel  metric 256  mtu 1500 advmss 1440 
hoplimit 4294967295
2001:6f8:900:::1 dev sixxs  metric 1024  mtu 1500 advmss 1440 hoplimit 
4294967295
2001:6f8:900::/48 via 2001:6f8:900:::1 dev sixxs  metric 1024  mtu 1500 
advmss 1440 hoplimit 4294967295
2001:6f8:900::/40 via 2001:6f8:900:::1 dev sixxs  metric 1024  mtu 1500 
advmss 1440 hoplimit 4294967295
2001:6f8:1000::/40 via 2001:6f8:900:::1 dev sixxs  metric 1024  mtu 1500 
advmss 1440 hoplimit 4294967295
2001:6f8:1100::/40 via 2001:6f8:900:::1 dev sixxs  metric 1024  mtu 1500 
advmss 1440 hoplimit 4294967295
2001:6f8:1200::/40 via 2001:6f8:900:::1 dev sixxs  metric 1024  mtu 1500 
advmss 1440 hoplimit 4294967295
2001:6f8:1300::/40 via 2001:6f8:900:::1 dev sixxs  metric 1024  mtu 1500 
advmss 1440 hoplimit 4294967295
fe80::/64 dev eth0  proto kernel  metric 256  mtu 1500 advmss 1440 hoplimit 
4294967295
default via fe80::5:73ff:fea0:1 dev eth0  metric 1024  mtu 1500 advmss 1440 
hoplimit 4294967295
--8

Yes, that is 5 /40s worth of address space and everything is piped into the 
sixxs interface to a single neighbor that lives on the tapped interface. We 
thus indeed hit the Linux routing logic a bit, but as the table is small and it 
is a single neighbor nothing much dynamic happens there. ip -6 monitor route 
is thus nice an silent.

Greets,
 Jeroen



Re: Linux IPv6 routing strange behaviour

2013-08-15 Thread Mikael Abrahamsson

On Thu, 15 Aug 2013, Jeroen Massar wrote:

Yes, that is 5 /40s worth of address space and everything is piped into 
the sixxs interface to a single neighbor that lives on the tapped 
interface. We thus indeed hit the Linux routing logic a bit, but as the 
table is small and it is a single neighbor nothing much dynamic happens 
there. ip -6 monitor route is thus nice an silent.


So you're actually not seeing any flow based routing here?

cat /proc/net/ipv6_route contains just those routes you see in ip -6 r 
show?


Because in my linux kernel 3.2 based machines I have a lot more entries in 
cat /proc/net/ipv6_route than I have routes.


--
Mikael Abrahamssonemail: swm...@swm.pp.se


Re: Linux IPv6 routing strange behaviour

2013-08-15 Thread Max Tulyev
On 15.08.13 15:14, Hannes Frederic Sowa wrote:
 
 You can also monitor routing insertion/deletion with ip -6 monitor route.
 

Yes! I think it shows the problem more. There are a lot of this errors:

netlink receive error No buffer space available (105)

What is it?

Here it is the sample:

Deleted 2001:7fb:ff02::/48 via 2a01:d0:0:1c::f9 dev eth1.778  proto
zebra  metric 1024  mtu 1500 advmss 1440 hoplimit 0
2001:7fb:ff02::/48 via 2a01:d0:0:1c::f9 dev eth1.778  proto zebra
metric 1024  mtu 1500 advmss 1440 hoplimit 0
netlink receive error No buffer space available (105)
Deleted 2001:7fb:ff02::/48 via 2a01:d0:0:1c::f9 dev eth1.778  proto
zebra  metric 1024  mtu 1500 advmss 1440 hoplimit 0
Deleted 2001:67c:1ec::/48 via fe80::92e2:baff:fe16:7e9 dev eth1.778
proto zebra  metric 1024  mtu 1500 advmss 1440 hoplimit 0
2001:67c:1ec::/48 via fe80::92e2:baff:fe16:7e9 dev eth1.778  proto zebra
 metric 1024  mtu 1500 advmss 1440 hoplimit 0
netlink receive error No buffer space available (105)
Deleted 2a01:bec0::/32 via fe80::21b:21ff:febf:96b4 dev eth1.778  proto
zebra  metric 1024  mtu 1500 advmss 1440 hoplimit 0
2a01:bec0::/32 via fe80::21b:21ff:febf:96b4 dev eth1.778  proto zebra
metric 1024  mtu 1500 advmss 1440 hoplimit 0
Deleted 2804:548::/32 via fe80::92e2:baff:fe16:7e9 dev eth1.778  proto
zebra  metric 1024  mtu 1500 advmss 1440 hoplimit 0
Deleted 2001:67c:2884::/48 via fe80::92e2:baff:fe16:7e9 dev eth1.778
proto zebra  metric 100  mtu 1500 advmss 1440 hoplimit 0
2001:67c:2884::/48 via fe80::92e2:baff:fe16:7e9 dev eth1.778  proto
zebra  metric 5500  mtu 1500 advmss 1440 hoplimit 0
Deleted 2001:67c:2884::/48 via fe80::92e2:baff:fe16:7e9 dev eth1.778
proto zebra  metric 5500  mtu 1500 advmss 1440 hoplimit 0
2001:67c:2884::/48 via fe80::92e2:baff:fe16:7e9 dev eth1.778  proto
zebra  metric 6600  mtu 1500 advmss 1440 hoplimit 0
Deleted 2a01:bec0::/32 via fe80::21b:21ff:febf:96b4 dev eth1.778  proto
zebra  metric 1024  mtu 1500 advmss 1440 hoplimit 0
2a01:bec0::/32 via fe80::21b:21ff:febf:96b4 dev eth1.778  proto zebra
metric 1024  mtu 1500 advmss 1440 hoplimit 0
Deleted 2a00:1a58::/32 via fe80::92e2:baff:fe16:7e9 dev eth1.778  proto
zebra  metric 1024  mtu 1500 advmss 1440 hoplimit 0
2a00:1a58::/32 via fe80::92e2:baff:fe16:7e9 dev eth1.778  proto zebra
metric 1024  mtu 1500 advmss 1440 hoplimit 0
2607:f088::/32 via fe80::92e2:baff:fe16:7e9 dev eth1.778  proto zebra
metric 1024  mtu 1500 advmss 1440 hoplimit 0
Deleted 2a00:1a58::/32 via fe80::92e2:baff:fe16:7e9 dev eth1.778  proto
zebra  metric 1024  mtu 1500 advmss 1440 hoplimit 0
2a00:1a58::/32 via fe80::92e2:baff:fe16:7e9 dev eth1.778  proto zebra
metric 1024  mtu 1500 advmss 1440 hoplimit 0
2804:548::/32 via fe80::92e2:baff:fe16:7e9 dev eth1.778  proto zebra
metric 1024  mtu 1500 advmss 1440 hoplimit 0
Deleted 2607:f088::/32 via fe80::92e2:baff:fe16:7e9 dev eth1.778  proto
zebra  metric 1024  mtu 1500 advmss 1440 hoplimit 0
2607:f088::/32 via fe80::92e2:baff:fe16:7e9 dev eth1.778  proto zebra
metric 1  mtu 1500 advmss 1440 hoplimit 0
Deleted 2001:67c:2884::/48 via fe80::92e2:baff:fe16:7e9 dev eth1.778
proto zebra  metric 6600  mtu 1500 advmss 1440 hoplimit 0
2001:67c:2884::/48 via fe80::92e2:baff:fe16:7e9 dev eth1.778  proto
zebra  metric 100  mtu 1500 advmss 1440 hoplimit 0
netlink receive error No buffer space available (105)
Deleted 2001:67c:1ec::/48 via fe80::92e2:baff:fe16:7e9 dev eth1.778
proto zebra  metric 1024  mtu 1500 advmss 1440 hoplimit 0
2001:67c:1ec::/48 via fe80::92e2:baff:fe16:7e9 dev eth1.778  proto zebra
 metric 1024  mtu 1500 advmss 1440 hoplimit 0
Deleted 2a00:1a58::/32 via fe80::92e2:baff:fe16:7e9 dev eth1.778  proto
zebra  metric 1024  mtu 1500 advmss 1440 hoplimit 0
2a00:1a58::/32 via fe80::92e2:baff:fe16:7e9 dev eth1.778  proto zebra
metric 1024  mtu 1500 advmss 1440 hoplimit 0
netlink receive error No buffer space available (105)
Deleted 2001:67c:1ec::/48 via fe80::92e2:baff:fe16:7e9 dev eth1.778
proto zebra  metric 1024  mtu 1500 advmss 1440 hoplimit 0
2001:67c:1ec::/48 via fe80::92e2:baff:fe16:7e9 dev eth1.778  proto zebra
 metric 1024  mtu 1500 advmss 1440 hoplimit 0



Re: Linux IPv6 routing strange behaviour

2013-08-15 Thread Jeroen Massar
On 2013-08-15 14:41, Mikael Abrahamsson wrote:
 On Thu, 15 Aug 2013, Jeroen Massar wrote:
 
 Yes, that is 5 /40s worth of address space and everything is piped
 into the sixxs interface to a single neighbor that lives on the tapped
 interface. We thus indeed hit the Linux routing logic a bit, but as
 the table is small and it is a single neighbor nothing much dynamic
 happens there. ip -6 monitor route is thus nice an silent.
 
 So you're actually not seeing any flow based routing here?
 
 cat /proc/net/ipv6_route contains just those routes you see in ip -6
 r show?
 
 Because in my linux kernel 3.2 based machines I have a lot more entries
 in cat /proc/net/ipv6_route than I have routes.

That is correct. Though on 2.6 you won't see those there from what I
recall, on 3.2 you will indeed see them.

In our case that means that the tunnels are not amongst them (and that
is where the majority of endpoints for us are, hence at minimum half the
table entries), while the uplink (which is a default route) will cause a
the packet to go through Linux's kernel and create the same entry over
and over.

We could likely avoid that if we wanted to, by sending the packet
ourselves to gateway and thus skipping the kernel's routing completely.
As the scaling[2] and performance is already much better (and we do not
have the randomly dropping interfaces[1] , and overhead is already
minimal enough, we did not bother doing that yet.

Greets,
 Jeroen

[1] Linux kernel uses a hashtable that can collide when there are lots
of tunnels; but as we know the address space layout anyway, we do not
have to bother with that.
[2] I recall that the interface table used to/is a linked list...



Re: Linux IPv6 routing strange behaviour

2013-08-15 Thread Hannes Frederic Sowa
On Thu, Aug 15, 2013 at 04:06:11PM +0300, Max Tulyev wrote:
 On 15.08.13 15:14, Hannes Frederic Sowa wrote:
  
  You can also monitor routing insertion/deletion with ip -6 monitor route.
  
 
 Yes! I think it shows the problem more. There are a lot of this errors:

I don't see timestamp but I guess you have a massive churn in the ip6_fib
because of the deletion and immediate insertion of prefixes.

 netlink receive error No buffer space available (105)

netlink are sockets, too, and they can run out of receive
buffer. Try ip -rc with a higher value then 1048576. Max value is
/proc/sys/net/core/rmem_max/rmem_max (you can increase this, too).

Thanks for the dumps, I will have a closer look later today.

Greetings,

  Hannes



Re: Linux IPv6 routing strange behaviour

2013-08-15 Thread Max Tulyev
Hi All,

I found exact the problem I described is *NOT* a kernel or Quagga/BIRD
problem. It is the bug of the ip utility. So this time it is not
affecting the routing itself.

I found that ip -6 route show outputs a lot of strings - the random
number from 1 to 100, with randomly repeated blocks of routes.
On this same server, the netstat -6rn continuously returns the same
~13500 routes without any repeating ;)

But the other bugs like routing stupidity, blank output in some
condition on netstat -6rn remains...