On Jan 21, 2014, at 12:55 PM, Jesse Gross <[email protected]> wrote:

> On Wed, Jan 8, 2014 at 4:15 PM, Jarno Rajahalme <[email protected]> wrote:
>> 'perf' report on current master shows that kernel-side locking is
>> taking more than 20% of the overall OVS execution time under TCP_CRR
>> tests with flow tables that make all new connections go to userspace
>> and that create huge amount of kernel flows (~60000) on a 32-CPU
>> server.  It turns out that disabling per-CPU flow stats largely
>> removes this overhead and significantly improves performance on *this
>> kind of a test*.
>> 
>> To address this problem, this series:
>> - reduces overhead by not locking all-zero stats
>> - Keeps flow stats on NUMA nodes, essentially avoiding locking between
>>  physical CPUs.  Stats readers still need to read/lock across the CPUs,
>>  but now once instead of 16 times in the test case described above.
>> 
>> In order to avoid performance regressions elsewhere, this series also
>> introduces prefetching to avoid stalling due to stats being out
>> of L1 cache for both readers and writers.
>> 
>> With all of these applied, the OVS kernel-side locking overhead is not
>> among the top of 'perf' reports any more.
>> 
>> The effectiveness of these strategies under different load scenarios
>> requires more testing.
> 
> Some high level comments:
> * What was the pattern of flows being installed in the kernel?
> There's an existing heuristic that should turn off per-CPU stats if
> the 5-tuple is an exact match. I'm surprised that TCP_CRR both caused
> a large number of flows to be installed and did not trigger this
> (obviously this doesn't help if there are large number of IP addresses
> but it seems like it should be OK for ports).

The test case has a rule that looks at the TCP port numbers, but not at the IP 
addresses, so the kernel flows are not “exact 5-tuple” flows. I’d expect this 
to be the norm whenever rules wildcard any IP address bits (with prefix trie 
lookups) or port number bits in the future.

> * The per-NUMA patch is very similar to Pravin's original per-CPU
> patch, which received some objections from upstream. Pravin was
> originally concerned about the cost of using the per-CPU allocator so
> it would be interesting to see how just that change compares to your
> results. I also suspect that you will get similar comments from
> upstream as with his patch, so it would be good to think about how to
> address those.

I’ve developed the concept a bit further (see the V2 patch I just sent out), 
now allocating the stats from memory pools of specific NUMA nodes. I think this 
is well justified, as I’m not trying to reimplement per CPU allocation without 
the per-CPU allocator.

Please let me know what you think of the patch V2.

> * Do you have data to support the prefetch changes that you can
> include in the commit message?

I previously missed that you asked for the prefetch numbers, sorry. Yes, doing 
the prefetch lowers the relevant spin lock overhead reported by ‘perf’, I’ll 
run some fresh tests and report back.

  Jarno

_______________________________________________
dev mailing list
[email protected]
http://openvswitch.org/mailman/listinfo/dev

Reply via email to