Stuart, I also can't say I've seen this, but I am curious now. I did have a few questions for you though.
1. When you say you set nf_conntrack_max/nf_conntrack_hash to 256k, did you really set the hash size that large? Typically the hash is 1/8 of the max, meaning you'd have 8 entries per hashbucket. 2. Does /sys/module/nf_conntrack/parameters/hashsize look correct? 3. Are you seeing any messages such as "nf_conntrack: table full, dropping packet" 4. How many entries are the in the conntrack table? 'sudo conntrack -C' 5. Have you been able to drill down any further into what's taking all the time in nf_conntrack_tuple_taken() ? I can't imagine you have a single bucket with tons of entries and you're spinning looking at each, but it could be that simple. Thanks, -Brian On 08/16/2014 12:12 PM, Stuart Fox wrote: > Hey neutron dev! > > Im having a serious problem with my neutron router getting spin locked in > nf_conntrack_tuple_taken. > Has anybody else experienced it? > "perf top" shows nf_conntrack_tuple_taken at 75% > As the incoming request rate goes up, so nf_conntrack_tuple_taken runs very > hot > on CPU0 causing ksoftirqd/0 to run at 100%. At that point internal pings on > the > GRE network go sky high and its game over. Pinging from a vm to the subnet > default gateway on the neutron goes from 0.2ms to 11s! pinging from the same > vm > to another vm in the same subnet stays constant at 0.2ms. > > Very much indicates to me that the neutron router is having serious problems. > No other part of the system seems under pressure. > > ipv6 is disabled, and nf_conntrack_max/nf_conntrack_hash are set to 256k. > We've tried the default 3.13 and the utopic 3.16 kernel (3.16 has lots of work > on removing spinlocks around nf_conntrack). 3.16 survives a little longer but > still gets in the same state > > Neutron router > 1 x Ubuntu 14.04/Icehouse 2014.1.1 on an ibm x3550 with 4 10G intel nics. > eth0 - Mgt > eth1 - GRE > eth2 - Public > eth3 - unused > > Compute/controller nodes > 43 x Ubuntu 14.04/Icehouse 2014.1.1 ibm x240 flex blades with 4 emulex nics > eth0 Mgt > eth2 GRE > > Any help very much appreciated! > Replace the l2/l3 functions with hardware is very much an option if thats a > better solution. > Im running out of time before my client decides to stay on AWS. > > > > BR, > Stuart > > > _______________________________________________ > OpenStack-dev mailing list > [email protected] > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > _______________________________________________ OpenStack-dev mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
