sched: Add hardware specific counters to TC actions

Eelco Chaudron Thu, 20 Sep 2018 00:14:27 -0700



On 29 Aug 2018, at 20:12, Jakub Kicinski wrote:

On Wed, 29 Aug 2018 11:43:47 +0200, Eelco Chaudron wrote:

On 23 Aug 2018, at 20:14, Jakub Kicinski wrote:

On Mon, 20 Aug 2018 16:03:40 +0200, Eelco Chaudron wrote:

On 17 Aug 2018, at 13:27, Jakub Kicinski wrote:

On Thu, 16 Aug 2018 14:02:44 +0200, Eelco Chaudron wrote:

On 11 Aug 2018, at 21:06, David Miller wrote:

From: Jakub Kicinski <jakub.kicin...@netronome.com>
Date: Thu, 9 Aug 2018 20:26:08 -0700

It is not immediately clear why this is needed.  The memory and
updating two sets of counters won't come for free, so perhaps a
stronger justification than troubleshooting is due? :S

Netdev has counters for fallback vs forwarded traffic, so you'd
know

that traffic hits the SW datapath, plus the rules which arein_hw

will
most likely not match as of today for flower (assuming
correctness).


I strongly believe that these counters are a requirement for a
mixed
software/hardware (flow) based forwarding environment. The global
counters will not help much here as you might have chosen to have
certain traffic forwarded by software.

These counters are probably the only option you have to figureout

why

forwarding is not as fast as expected, and you want to blame theTC

offload NIC.


The suggested debugging flow would be:
 (1) check the global counter for fallback are incrementing;
 (2) find a flow with high stats but no in_hw flag set.

The in_hw indication should be sufficient in most cases (unless
there
are shared blocks between netdevs of different ASICs...).


I guess the aim is to find miss behaving hardware, i.e. having the
in_hw
flag set, but flows still coming to the kernel.

For misbehaving hardware in_hw will not work indeed. Whether weneed

these extra always-on stats for such use case could be debated :)

I'm slightly concerned about potential performance impact,would
you
be able to share some numbers for non-trivial number of flows
(100k
active?)?


Agreed, features used for diagnostics cannot have a harmful
penalty
for fast path performance.


Fast path performance is not affected as these counters are not

incremented there. They are only incremented by the nic driverwhen

they
gather their statistics from hardware.


Not by much, you are adding state to performance-critical
structures,
though, for what is effectively debugging purposes.

I was mostly talking about the HW offload stat updates (sorry for
not
being clear).

We can have some hundreds of thousands active offloaded flows,each

of
them can have multiple actions, and stats have to be updated
multiple

times per second and dumped probably around once a second, too.On

a
busy system the stats will get evicted from cache between each
round.

But I'm speculating let's see if I can get some numbers on it (if
you
could get some too, that would be great!).


I’ll try to measure some of this later this week/early next week.


I asked Louis to run some tests while I'm travelling, and he reports
that my worry about reporting the extra stats was unfounded.  Update
function does not show up in traces at all.  It seems under stress
(generated with stress-ng) the thread dumping the stats in userspace
(in OvS it would be the revalidator) actually consumes less CPU in
__gnet_stats_copy_basic (0.4% less for ~2.0% total).

Would this match with your results? I'm not sure why dumping wouldbe

faster with your change..

Tested with OVS and https://github.com/chaudron/ovs_perf using 300KTC

rules installed in HW.

For __gnet_stats_copy_basic() being faster I have (had) a theory. Now

this function is called twice, and I assumed the first call wouldcache

memory and the second call would be faster.

Sampling a lot of perf data, I get an average of 1115ns with the base
kernel and 954ns with the fix applied, so about ~14%.

Thought I would perf tcf_action_copy_stats() as it is the placeupdating

the additional counter. But even in this case, I see a better
performance with the patch applied.

In average 13581ns with the fix, vs base kernel at 1391ns, so about
2.3%.

I guess the changes to the tc_action structure got better cache
alignment.


Interesting you could reproduce the speed up too!  +1 for the guess.

Seems like my caution about slowing down SW paths to support HWoffload

landed on a very unfortunate patch :)


Is there anything else blocking from getting this into net-next?

I still think this patch is beneficial for the full user experience, andI’ve got requests from QA and others for this.

Re: [PATCH 0/2] net/sched: Add hardware specific counters to TC actions

Reply via email to