Re: netisr observations

2014-04-11 Thread Adrian Chadd
[snip]

So, hm, the thing that comes to mind is the flowid. What's the various
flowid's for flows? Are they all mapping to CPU 3 somehow?

-a
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


tunnels, mtu and payload length

2014-04-11 Thread Eugene M. Zheganin
Hi.

Can someone explain me where are the 4 missing bytes when capturing
traffic on a gif interface with a tcpdump ?
I expect to see the length of the first fragment (offset = 0) to be
equal to an mtu (1280 bytes), but clearly it's 1276 bytes.
Same thing happens to a gre tunnel.

# ifconfig gif0
gif0: flags=8051UP,POINTOPOINT,RUNNING,MULTICAST metric 0 mtu 1280
tunnel inet 192.168.3.24 -- 192.168.3.17
inet 172.16.5.40 -- 172.16.5.41 netmask 0x
inet6 fe80::21a:64ff:fe21:8e80%gif0 prefixlen 64 scopeid 0x1c
nd6 options=21PERFORMNUD,AUTO_LINKLOCAL

# ping -s 4096 172.16.5.41
PING 172.16.5.41 (172.16.5.41): 4096 data bytes
4104 bytes from 172.16.5.41: icmp_seq=0 ttl=64 time=0.837 ms
4104 bytes from 172.16.5.41: icmp_seq=1 ttl=64 time=0.870 ms
4104 bytes from 172.16.5.41: icmp_seq=2 ttl=64 time=0.779 ms
4104 bytes from 172.16.5.41: icmp_seq=3 ttl=64 time=0.823 ms
4104 bytes from 172.16.5.41: icmp_seq=4 ttl=64 time=0.794 ms

tcpdump:

12:58:33.430450 IP (tos 0x0, ttl 64, id 40760, offset 0, flags [+],
proto ICMP (1), length 1276)
172.16.5.40  172.16.5.41: ICMP echo request, id 62980, seq 17,
length 1256
12:58:33.430467 IP (tos 0x0, ttl 64, id 40760, offset 1256, flags [+],
proto ICMP (1), length 1276)
172.16.5.40  172.16.5.41: ip-proto-1
12:58:33.430481 IP (tos 0x0, ttl 64, id 40760, offset 2512, flags [+],
proto ICMP (1), length 1276)
172.16.5.40  172.16.5.41: ip-proto-1
12:58:33.430494 IP (tos 0x0, ttl 64, id 40760, offset 3768, flags
[none], proto ICMP (1), length 356)
172.16.5.40  172.16.5.41: ip-proto-1

Thanks.
Eugene.
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


dummynet/ipfw high load?

2014-04-11 Thread Dennis Yusupoff
Good day, gurus!

We have a servers on the FreeBSD. They do NAT, shaping and traffic
accounting for our home (mainly) customers.
NAT realized with pf nat, shaping with ipfw dummynet and traffic
accounting with ng_netflow via ipfw ng_tee.
The problem is performance on (relatively) high traffic.
On Xeon E3-1270, whereas use Intel 10Gbit/sec 82599-based NIC(ix) or
Intel I350 (82579) in lagg transit traffic in 800 Mbit/sec and 100 kpps
[to customers] cause CPU load almost at 100% by interrupts from NIC or,
in case of net.isr.dispatch=deferred and net.inet.ip.fastforwarding=0.
Deleting ipfw pipe decrease load at ~30% per cpu.
Deleting ipfw ng_tee (to ng_netflow) decrease load at 15% per cpu.
Turning off ipfw (sysctl net.inet.ip.fw.enable=0) decrease load more, so
what server can pass (nat'ed!) traffic on 1600 Mbit/sec and 200 kpps
with only load ~40% per cpu.

So my questions are:
1. Are there any way to decrease system load caused by dummynet/ipfw?
2. Why dummynet/ipfw increase *interrupts* load, not kernel or
something like that?
3. Are there any way to profiling that kind of load? Existing DTrace
and pmcstat examples almost useless or I just doesn't know how to do it
properly.

Huge size of debugging info (including dtrace and pmcstat samples),
sysctl settings and so on, I opened appropriate topic at russian network
operator's forum: http://forum.nag.ru/forum/index.php?showtopic=93674
In english it's available via google translate:
http://translate.google.com/translate?hl=ensl=autotl=enu=http%3A%2F%2Fforum.nag.ru%2Fforum%2Findex.php%3Fshowtopic%3D93674

Feel free to ask me any question and do actions on the server!

I would be VERY appreciate for any help and can take any measuring and
debugging on the one server. Moreover, I'm ready to give root access to
any of the appropriate person (as I already did it to Gleb Smirnoff when
we were investigate pf state problem).


-- 
Best regards,
Dennis Yusupoff,
network engineer of
Smart-Telecom ISP
Russia, Saint-Petersburg 

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: Patches for RFC6937 and draft-ietf-tcpm-newcwv-00

2014-04-11 Thread Eggert, Lars
Hi,

since folks are playing with Midori's DCTCP patch, I wanted to make sure that 
you were also aware of the patches that Aris did for PRR and NewCWV...

Lars

On 2014-2-4, at 10:38, Eggert, Lars l...@netapp.com wrote:

 Hi,
 
 below are two patches that implement RFC6937 (Proportional Rate Reduction 
 for TCP) and draft-ietf-tcpm-newcwv-00 (Updating TCP to support 
 Rate-Limited Traffic). They were done by Aris Angelogiannopoulos for his MS 
 thesis, which is at https://eggert.org/students/angelogiannopoulos-thesis.pdf.
 
 The patches should apply to -CURRENT as of Sep 17, 2013. (Sorry for the delay 
 in sending them, we'd been trying to get some feedback from committers first, 
 without luck.)
 
 Please note that newcwv is still a work in progress in the IETF, and the 
 patch has some limitations with regards to the pipeACK Sampling Period 
 mentioned in the Internet-Draft. Aris says this in his thesis about what 
 exactly he implemented:
 
 The second implementation choice, is in regards with the measurement of 
 pipeACK. This variable is the most important introduced by the method and is 
 used to compute the phase that the sender currently lies in. In order to 
 compute pipeACK the approach suggested by the Internet Draft (ID) is followed 
 [ncwv]. During initialization, pipeACK is set to the maximum possible value. 
 A helper variable prevHighACK is introduced that is initialized to the 
 initial sequence number (iss). prevHighACK holds the value of the highest 
 acknowledged byte so far. pipeACK is measured once per RTT meaning that when 
 an ACK covering prevHighACK is received, pipeACK becomes the difference 
 between the current ACK and prevHighACK. This is called a pipeACK sample.  A 
 newer version of the draft suggests that multiple pipeACK samples can be used 
 during the pipeACK sampling period.
 
 Lars
 
 prr.patchnewcwv.patch



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: dummynet/ipfw high load?

2014-04-11 Thread Sami Halabi
Hi,
I had similar problem on the past and it turned to be the ammount of rules
in ipfe.
Using reduced subset with tables actually reduced the load.

Sami

‏בתאריך יום שישי, 11 באפריל 2014, Dennis Yusupoff d...@smartspb.net כתב:

 Good day, gurus!

 We have a servers on the FreeBSD. They do NAT, shaping and traffic
 accounting for our home (mainly) customers.
 NAT realized with pf nat, shaping with ipfw dummynet and traffic
 accounting with ng_netflow via ipfw ng_tee.
 The problem is performance on (relatively) high traffic.
 On Xeon E3-1270, whereas use Intel 10Gbit/sec 82599-based NIC(ix) or
 Intel I350 (82579) in lagg transit traffic in 800 Mbit/sec and 100 kpps
 [to customers] cause CPU load almost at 100% by interrupts from NIC or,
 in case of net.isr.dispatch=deferred and net.inet.ip.fastforwarding=0.
 Deleting ipfw pipe decrease load at ~30% per cpu.
 Deleting ipfw ng_tee (to ng_netflow) decrease load at 15% per cpu.
 Turning off ipfw (sysctl net.inet.ip.fw.enable=0) decrease load more, so
 what server can pass (nat'ed!) traffic on 1600 Mbit/sec and 200 kpps
 with only load ~40% per cpu.

 So my questions are:
 1. Are there any way to decrease system load caused by dummynet/ipfw?
 2. Why dummynet/ipfw increase *interrupts* load, not kernel or
 something like that?
 3. Are there any way to profiling that kind of load? Existing DTrace
 and pmcstat examples almost useless or I just doesn't know how to do it
 properly.

 Huge size of debugging info (including dtrace and pmcstat samples),
 sysctl settings and so on, I opened appropriate topic at russian network
 operator's forum: http://forum.nag.ru/forum/index.php?showtopic=93674
 In english it's available via google translate:

 http://translate.google.com/translate?hl=ensl=autotl=enu=http%3A%2F%2Fforum.nag.ru%2Fforum%2Findex.php%3Fshowtopic%3D93674

 Feel free to ask me any question and do actions on the server!

 I would be VERY appreciate for any help and can take any measuring and
 debugging on the one server. Moreover, I'm ready to give root access to
 any of the appropriate person (as I already did it to Gleb Smirnoff when
 we were investigate pf state problem).


 --
 Best regards,
 Dennis Yusupoff,
 network engineer of
 Smart-Telecom ISP
 Russia, Saint-Petersburg

 ___
 freebsd-net@freebsd.org javascript:; mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-net
 To unsubscribe, send any mail to 
 freebsd-net-unsubscr...@freebsd.orgjavascript:;
 



-- 
Sami Halabi
Information Systems Engineer
NMS Projects Expert
FreeBSD SysAdmin Expert
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org

Re: Patches for RFC6937 and draft-ietf-tcpm-newcwv-00

2014-04-11 Thread hiren panchasara
On Fri, Apr 11, 2014 at 4:15 AM, Eggert, Lars l...@netapp.com wrote:
 Hi,

 since folks are playing with Midori's DCTCP patch, I wanted to make sure that 
 you were also aware of the patches that Aris did for PRR and NewCWV...


 prr.patchnewcwv.patch

Lars,

There are no actual patches attached here. (Or the mailing-list dropped them.)

cheers,
Hiren
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: netisr observations

2014-04-11 Thread Patrick Kelsey
On Fri, Apr 11, 2014 at 2:48 AM, Adrian Chadd adr...@freebsd.org wrote:

 [snip]

 So, hm, the thing that comes to mind is the flowid. What's the various
 flowid's for flows? Are they all mapping to CPU 3 somehow


The output of netstat -Q shows IP dispatch is set to default, which is
direct (NETISR_DISPATCH_DIRECT).  That means each IP packet will be
processed on the same CPU that the Ethernet processing for that packet was
performed on, so CPU selection for IP packets will not be based on flowid.
The output of netstat -Q shows Ethernet dispatch is set to direct
(NETISR_DISPATCH_DIRECT if you wind up reading the code), so the Ethernet
processing for each packet will take place on the same CPU that the driver
receives that packet on.

For the igb driver with queues autoconfigured and msix enabled, as the
sysctl output shows you have, the driver will create a number of queues
subject to device limitations, msix message limitations, and the number of
CPUs in the system, establish a separate interrupt handler for each one,
and bind each of those interrupt handlers to a separate CPU.  It also
creates a separate single-threaded taskqueue for each queue.  Each queue
interrupt handler sends work to its associated taskqueue when the interrupt
fires.  Those taskqueues are where the Ethernet packets are received and
processed by the driver.  The question is where those taskqueue threads
will be run.  I don't see anything in the driver that makes an attempt to
bind those taskqueue threads to specific CPUs, so really the location of
all of the packet processing is up to the scheduler (i.e., arbitrary).

The summary is:

1. the hardware schedules each received packet to one of its queues and
raises the interrupt for that queue
2. that queue interrupt is serviced on the same CPU all the time, which is
different from the CPUs for all other queues on that interface
3. the interrupt handler notifies the corresponding task queue, which runs
its task in a thread on whatever CPU the scheduler chooses
4. that task dispatches the packet for Ethernet processing via netisr,
which processes it on whatever the current CPU is
5. Ethernet processing dispatches that packet for IP processing via netisr,
which processes it on whatever the current CPU is

You might want to try changing the default netisr dispatch policy to
'deferred' (sysctl net.isr.dispatch=deferred).  If you do that, the
Ethernet processing will still happen on an arbitrary CPU chosen by the
scheduler, but the IP processing should then get mapped to a CPU based on
the flowid assigned by the driver.  Since igb assigns flowids based on
received queue number, all IP (and above) processing for that packet should
then be performed on the same CPU the queue interrupt was bound to.

Unfortunately, I don't have a system with igb interfaces to try that on.

-Patrick
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: Preventing ng_callout() timeouts to trigger packet queuing

2014-04-11 Thread Julian Elischer

disclaimer: I'm not looking at the code now.. I want to go to bed:  :-)

When I wrote that code, the idea was that even a direct node execution 
should become a queuing operation if there was already something else 
on the queue.  so in that model packets were not supposed to get 
re-ordered. does that not still work?


Either that, or you need to explain the problem to me a bit better..


On 4/10/14, 5:25 AM, Karim Fodil-Lemelin wrote:

Hi,

Below is a revised patch for this issue. It accounts for nodes or 
hooks that explicitly need to be queuing:


@@ -3632,7 +3632,12 @@ ng_callout(struct callout *c, node_p node, 
hook_p hook, int ticks,

if ((item = ng_alloc_item(NGQF_FN, NG_NOFLAGS)) == NULL)
return (ENOMEM);

-   item-el_flags |= NGQF_WRITER;
+   if ((node-nd_flags  NGF_FORCE_WRITER) ||
+   (hook  (hook-hk_flags  HK_FORCE_WRITER)))
+ item-el_flags |= NGQF_WRITER;
+   else
+ item-el_flags |= NGQF_READER;
+
NG_NODE_REF(node);  /* and one for the item */
NGI_SET_NODE(item, node);
if (hook) {

Regards,

Karim.

On 09/04/2014 3:16 PM, Karim Fodil-Lemelin wrote:

Hi List,

I'm calling out to the general wisdom ... I have seen an issue in 
netgraph where, if called, a callout routine registered by 
ng_callout() will trigger packet queuing inside the worklist of 
netgraph since ng_callout() makes my node suddenly a WRITER node 
(therefore non reentrant) for the duration of the call.


So as soon as the callout function returns, all following packets 
will get directly passed to the node again and when the ngintr 
thread gets executed then only then will I get the queued packets. 
This introduces out of order packets in the flow. I am using the 
current patch below to solve the issue and I am wondering if there 
is anything wrong with it (and maybe contribute back :):



@@ -3632,7 +3632,7 @@ ng_callout(struct callout *c, node_p node, 
hook_p hook, int ticks,

if ((item = ng_alloc_item(NGQF_FN, NG_NOFLAGS)) == NULL)
return (ENOMEM);

-   item-el_flags |= NGQF_WRITER;
+   item-el_flags = NGQF_READER;
NG_NODE_REF(node);  /* and one for the item */
NGI_SET_NODE(item, node);
if (hook) {


Best regards,

Karim.
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org



___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: Preventing ng_callout() timeouts to trigger packet queuing

2014-04-11 Thread Adrian Chadd
Well, ethernet drivers nowdays seem to be doing:

* always queue
* then pop the head item off the queue and transmit that.


-a


On 11 April 2014 11:59, Julian Elischer jul...@freebsd.org wrote:
 disclaimer: I'm not looking at the code now.. I want to go to bed:  :-)

 When I wrote that code, the idea was that even a direct node execution
 should become a queuing operation if there was already something else on the
 queue.  so in that model packets were not supposed to get re-ordered. does
 that not still work?

 Either that, or you need to explain the problem to me a bit better..



 On 4/10/14, 5:25 AM, Karim Fodil-Lemelin wrote:

 Hi,

 Below is a revised patch for this issue. It accounts for nodes or hooks
 that explicitly need to be queuing:

 @@ -3632,7 +3632,12 @@ ng_callout(struct callout *c, node_p node, hook_p
 hook, int ticks,
 if ((item = ng_alloc_item(NGQF_FN, NG_NOFLAGS)) == NULL)
 return (ENOMEM);

 -   item-el_flags |= NGQF_WRITER;
 +   if ((node-nd_flags  NGF_FORCE_WRITER) ||
 +   (hook  (hook-hk_flags  HK_FORCE_WRITER)))
 + item-el_flags |= NGQF_WRITER;
 +   else
 + item-el_flags |= NGQF_READER;
 +
 NG_NODE_REF(node);  /* and one for the item */
 NGI_SET_NODE(item, node);
 if (hook) {

 Regards,

 Karim.

 On 09/04/2014 3:16 PM, Karim Fodil-Lemelin wrote:

 Hi List,

 I'm calling out to the general wisdom ... I have seen an issue in
 netgraph where, if called, a callout routine registered by ng_callout() will
 trigger packet queuing inside the worklist of netgraph since ng_callout()
 makes my node suddenly a WRITER node (therefore non reentrant) for the
 duration of the call.

 So as soon as the callout function returns, all following packets will
 get directly passed to the node again and when the ngintr thread gets
 executed then only then will I get the queued packets. This introduces out
 of order packets in the flow. I am using the current patch below to solve
 the issue and I am wondering if there is anything wrong with it (and maybe
 contribute back :):


 @@ -3632,7 +3632,7 @@ ng_callout(struct callout *c, node_p node, hook_p
 hook, int ticks,
 if ((item = ng_alloc_item(NGQF_FN, NG_NOFLAGS)) == NULL)
 return (ENOMEM);

 -   item-el_flags |= NGQF_WRITER;
 +   item-el_flags = NGQF_READER;
 NG_NODE_REF(node);  /* and one for the item */
 NGI_SET_NODE(item, node);
 if (hook) {


 Best regards,

 Karim.
 ___
 freebsd-net@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-net
 To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


 ___
 freebsd-net@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-net
 To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


 ___
 freebsd-net@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-net
 To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: Patches for RFC6937 and draft-ietf-tcpm-newcwv-00

2014-04-11 Thread hiren panchasara
On Fri, Apr 11, 2014 at 4:16 PM, hiren panchasara
hiren.panchas...@gmail.com wrote:
 On Fri, Apr 11, 2014 at 4:15 AM, Eggert, Lars l...@netapp.com wrote:
 Hi,

 since folks are playing with Midori's DCTCP patch, I wanted to make sure 
 that you were also aware of the patches that Aris did for PRR and NewCWV...


 prr.patchnewcwv.patch

 Lars,

 There are no actual patches attached here. (Or the mailing-list dropped them.)

Ah, my bad. I think you are referring to the patches in original
email. I can see them.

cheers,
Hiren
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: netisr observations

2014-04-11 Thread hiren panchasara
On Fri, Apr 11, 2014 at 11:30 AM, Patrick Kelsey kel...@ieee.org wrote:

 The output of netstat -Q shows IP dispatch is set to default, which is
 direct (NETISR_DISPATCH_DIRECT).  That means each IP packet will be
 processed on the same CPU that the Ethernet processing for that packet was
 performed on, so CPU selection for IP packets will not be based on flowid.
 The output of netstat -Q shows Ethernet dispatch is set to direct
 (NETISR_DISPATCH_DIRECT if you wind up reading the code), so the Ethernet
 processing for each packet will take place on the same CPU that the driver
 receives that packet on.

 For the igb driver with queues autoconfigured and msix enabled, as the
 sysctl output shows you have, the driver will create a number of queues
 subject to device limitations, msix message limitations, and the number of
 CPUs in the system, establish a separate interrupt handler for each one, and
 bind each of those interrupt handlers to a separate CPU.  It also creates a
 separate single-threaded taskqueue for each queue.  Each queue interrupt
 handler sends work to its associated taskqueue when the interrupt fires.
 Those taskqueues are where the Ethernet packets are received and processed
 by the driver.  The question is where those taskqueue threads will be run.
 I don't see anything in the driver that makes an attempt to bind those
 taskqueue threads to specific CPUs, so really the location of all of the
 packet processing is up to the scheduler (i.e., arbitrary).

 The summary is:

 1. the hardware schedules each received packet to one of its queues and
 raises the interrupt for that queue
 2. that queue interrupt is serviced on the same CPU all the time, which is
 different from the CPUs for all other queues on that interface
 3. the interrupt handler notifies the corresponding task queue, which runs
 its task in a thread on whatever CPU the scheduler chooses
 4. that task dispatches the packet for Ethernet processing via netisr, which
 processes it on whatever the current CPU is
 5. Ethernet processing dispatches that packet for IP processing via netisr,
 which processes it on whatever the current CPU is

I really appreciate you taking time and explaining this. Thank you.

I am specially confused with ip Queued column from netstat -Q
showing 203888563 only for cpu3. Does this mean that cpu3 queues
everything and then distributes among other cpus? Where does this
queuing on cpu3 happens out of 5 stages you mentioned above?

This value gets populated in snwp-snw_queued field for each cpu
inside sysctl_netisr_work().


 You might want to try changing the default netisr dispatch policy to
 'deferred' (sysctl net.isr.dispatch=deferred).  If you do that, the Ethernet
 processing will still happen on an arbitrary CPU chosen by the scheduler,
 but the IP processing should then get mapped to a CPU based on the flowid
 assigned by the driver.  Since igb assigns flowids based on received queue
 number, all IP (and above) processing for that packet should then be
 performed on the same CPU the queue interrupt was bound to.

I will give this a try and see how things behave.

I was also thinking about net.isr.bindthreads. netisr_start_swi() does
intr_event_bind() if we have it bindthreads set to 1. What would that
gain me, if anything?

Would it stop moving intr{swi1: netisr 3} on to different cpus (as I
am seeing in 'top' o/p) and bind it to a single cpu?

I've came across a thread discussing some side-effects of this though:
http://lists.freebsd.org/pipermail/freebsd-hackers/2012-January/037597.html

Thanks a ton, again.

cheers,
Hiren
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: netisr observations

2014-04-11 Thread Patrick Kelsey
On Fri, Apr 11, 2014 at 8:23 PM, hiren panchasara 
hiren.panchas...@gmail.com wrote:

 On Fri, Apr 11, 2014 at 11:30 AM, Patrick Kelsey kel...@ieee.org wrote:
 
  The output of netstat -Q shows IP dispatch is set to default, which is
  direct (NETISR_DISPATCH_DIRECT).  That means each IP packet will be
  processed on the same CPU that the Ethernet processing for that packet
 was
  performed on, so CPU selection for IP packets will not be based on
 flowid.
  The output of netstat -Q shows Ethernet dispatch is set to direct
  (NETISR_DISPATCH_DIRECT if you wind up reading the code), so the Ethernet
  processing for each packet will take place on the same CPU that the
 driver
  receives that packet on.
 
  For the igb driver with queues autoconfigured and msix enabled, as the
  sysctl output shows you have, the driver will create a number of queues
  subject to device limitations, msix message limitations, and the number
 of
  CPUs in the system, establish a separate interrupt handler for each one,
 and
  bind each of those interrupt handlers to a separate CPU.  It also
 creates a
  separate single-threaded taskqueue for each queue.  Each queue interrupt
  handler sends work to its associated taskqueue when the interrupt fires.
  Those taskqueues are where the Ethernet packets are received and
 processed
  by the driver.  The question is where those taskqueue threads will be
 run.
  I don't see anything in the driver that makes an attempt to bind those
  taskqueue threads to specific CPUs, so really the location of all of the
  packet processing is up to the scheduler (i.e., arbitrary).
 
  The summary is:
 
  1. the hardware schedules each received packet to one of its queues and
  raises the interrupt for that queue
  2. that queue interrupt is serviced on the same CPU all the time, which
 is
  different from the CPUs for all other queues on that interface
  3. the interrupt handler notifies the corresponding task queue, which
 runs
  its task in a thread on whatever CPU the scheduler chooses
  4. that task dispatches the packet for Ethernet processing via netisr,
 which
  processes it on whatever the current CPU is
  5. Ethernet processing dispatches that packet for IP processing via
 netisr,
  which processes it on whatever the current CPU is

 I really appreciate you taking time and explaining this. Thank you.


Sure thing.  I've had my head in the netisr code frequently lately, and
it's nice to be able to share :)



 I am specially confused with ip Queued column from netstat -Q
 showing 203888563 only for cpu3. Does this mean that cpu3 queues
 everything and then distributes among other cpus? Where does this
 queuing on cpu3 happens out of 5 stages you mentioned above?

 This value gets populated in snwp-snw_queued field for each cpu
 inside sysctl_netisr_work().


The way your system is configured, all inbound packets are being
direct-dispatched.  Those packets will bump the dispatched and handled
counters, but not the queued counter.  The queued counter only gets bumped
when something is queued to a netisr thread.  You can figure out where that
is happening, despite everything apparently being configured for direct
dispatch, by looking at where netisr_queue() and netisr_queue_src() are
being called from.  netisr_queue() is called during ipv6 forwarding and
output and ipv4 output when the destination is a local address, gre
processing, route socket processing, if_simloop() (which is called to loop
back multicast packets, for example)...  netisr_queue_src() is called
during ipsec and divert processing.

One thing to consider also when thinking about what the netisr per-cpu
counters represent is that netisr really maintains per-cpu workstream
context, not per-netisr-thread.  Direct-dispatched packets contribute to
the statistics of the workstream context of whichever CPU they are being
direct-dispatched on.  Packets handled by a netisr thread contribute to the
statistics of the workstream context of the CPU it was created for, whether
or not it was bound to, or is currently running on, that CPU.  So when you
look at the statistics in netstat -Q output for CPU 3, dispatched is the
number of packets direct-dispatched on CPU 3, queued is the number of
packets queued to the netisr thread associated with CPU 3 (but that may be
running all over the place if net.isr.bindthreads is 0), and handled is the
number of packets processed directly on CPU 3 or in the netisr thread
associated with CPU3.




 
  You might want to try changing the default netisr dispatch policy to
  'deferred' (sysctl net.isr.dispatch=deferred).  If you do that, the
 Ethernet
  processing will still happen on an arbitrary CPU chosen by the scheduler,
  but the IP processing should then get mapped to a CPU based on the flowid
  assigned by the driver.  Since igb assigns flowids based on received
 queue
  number, all IP (and above) processing for that packet should then be
  performed on the same CPU the queue interrupt was bound to.

 I