Re: misc/164130: broken netisr initialization

Robert N. M. Watson Mon, 30 Jan 2012 02:29:01 -0800

On 17 Jan 2012, at 17:41, Коньков Евгений wrote:

> Loads only netisr3.
> and question: ip works over ethernet. How you can distinguish ip and ether???


netstat -Q is showing you per-protocol (layer) processing statistics. An IP 
packet arriving via ethernet will typically be counted twice: once for ethernet 
input/decapsulation, and once for IP-layer processing. Netisr dispatch serves a 
number of purposes, not least preventing excessive stack depth/recursion and 
load balancing.

There has been a historic tension between deferred (queued) dispatch to a 
separate worker and direct dispatch ("process to completion"). The former 
offers more opportunities for parallelism and reduces latency during 
interrupt-layer processing. However, the latter reduces overhead and overall 
packet latency for higher-level parallelism by avoiding queueing/scheduling 
overheads, as well as avoiding packets migration between caches, reducing cache 
coherency traffic. Our general experience is that many common configurations, 
especially lower-end systems *and* systems with multi-queue 10gbps cards, 
prefer direct dispatch. However, there are forwarding scenarios or ones in 
which CPU count significantly outnumbers NIC input queue count, where queuing 
to additional workers can markedly improve performance.

In FreeBSD 9.0 we've attempted to improve the vocabulary of expressible 
policies in netisr so that we can explore which work best in various scenarios, 
giving users more flexibility but also attempting to determine a better 
longer-term model. Ideally, as with the VM system, these features would be to 
some extent self-tuning, but we don't have enough information and experience to 
decide how best to do that yet.

>     NETISR_POLICY_FLOW    netisr should maintain flow ordering as defined by
>                           the mbuf header flow ID field.  If the protocol
>                           implements nh_m2flow, then netisr will query the
>                           protocol in the event that the mbuf doesn't have a
>                           flow ID, falling back on source ordering.
> 
>     NETISR_POLICY_CPU     netisr will entirely delegate all work placement
>                           decisions to the protocol, querying nh_m2cpuid for
>                           each packet.
> 
> _FLOW: description says that cpuid discovered by flow.
> _CPU: here decision to choose CPU is deligated to protocol. maybe it
> will be clear to name it as: NETISR_POLICY_PROTO ???

The name has to do with the nature of the information returned by the netisr 
protocol handler -- in the former case, the protocol returns a flow identifier, 
which is used by netisr to calculate an affinity. In the latter case, the 
protocol returns a CPU affinity directly.

> and BIG QUESTION: why you allow to somebody (flow, proto) to make any
> decisions??? That is wrong: because of bad their
> implementation/decision may cause to schedule packets only to some CPU.
> So one CPU will overloaded (0%idle) other will be free. (100%idle)

I think you're confusing policy and mechanism. The above KPIs are about 
providing the mechanism to implement a variety of policies. Many of the 
policies we are interested in are not yet implemented, or available only as 
patches. Keep in mind that workloads and systems are highly variable, with 
variable costs for work dispatch, etc. We run on high-end Intel servers, where 
individual CPUs tend to be very powerful but not all that plentiful, but also 
embedded multi-threadd MIPS devices with many threads, each individually quite 
weak. Deferred dispatch is a better choice for the latter, where there are 
optimised handoff primitives to help avoid queueing overhead, whereas in the 
former case you really want NIC-backed work dispatch, which will generally mean 
you want direct dispatch with multiple ithreads (one per queue) rather than 
multiple netisr threads. Using deferred dispatch in Intel-style environments is 
generally unproductive, since high-end configurations will support multi-queue 
input already, and CPUs are quite powerful.


>> * Enforcing ordering limits the opportunity for concurrency, but maintains
>> * the strong ordering requirements found in some protocols, such as TCP.
> TCP do not require strong ordering requiremets!!! Maybe you mean UDP?

I think most people would disagree with this. Reordering TCP segments leads to 
extremely poor TCP behaviour -- there is an extensive research literature on 
this, and maintaining ordering for TCP flows is a critical network stack design 
goal.

> To get full concurency you must put new flowid to free CPU and
> remember cpuid for that flow.

Stateful assignment of flows to CPUs is of significant interest to use, 
although currently we only support hash-based assignment without state. In 
large part, that decision is a good one, as multi-queue network cards are 
highly variable in terms of the size of their state tables for offloading 
flow-specific affinity policies. For example, lower-end 10gbps cards may 
support state tables with 32 entries. High-end cards may support state tables 
with tens of thousands of entries.

> Just hash packetflow to then number of thrreads: net.isr.numthreads
> nws_array[flowid]= hash( flowid, sourceid, ifp->if_index, source )
> if( cpuload( nws_array[flowid] )>99 )
> nws_array[flowid]++;  //queue packet to other CPU
> 
> that will be just ten lines of conde instead of 50 in your case.

We support a more complex KPI because we need to support future policies that 
are more complex. For example, there are out-of-tree changes that align 
TCP-level and netisr-level per-CPU data structures and affinity with NIC RSS 
support. The algorithm you've suggested above explicitly introduces reordering, 
which would significant damage network performance, even though it appears to 
balance CPU load better.

> Also nitice you have:
> /*
> * Utility routines for protocols that implement their own mapping of flows
> * to CPUs.
> */
> u_int
> netisr_get_cpucount(void)
> {
> 
>        return (nws_count);
> }
> 
> but you do not use it! that break incapsulation.

This is a public symbol for use outside of the netisr framework -- for example, 
in the uncommitted RSS code.

> Also I want to ask you: help me please where I can find documention
> about scheduling netisr and full packetflow through kernel:
> packetinput->kernel->packetoutput
> but more description what is going on with packet while it is passing
> router.

Unfortunately, this code is currently largely self-documenting. The Stevens' 
books are getting quite outdated, as are McKusick/Neville-Neil -- however, they 
at least offer structural guides which may be of use to you. Refreshes of these 
books would be extremely helpful.

Robert_______________________________________________
[email protected] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "[email protected]"

Re: misc/164130: broken netisr initialization

Reply via email to