Re: Polling tuning and performance

Bruce Evans Fri, 15 Dec 2006 23:12:30 -0800

On Fri, 15 Dec 2006, Alan Amesbury wrote:

Bruce Evans wrote:

...
The (extremely busy) interface is exclusively incoming traffic, received
promiscuously.  Since that's provided enough clues as to what this box
might actually be doing, I'll give away the secret:  It's running snort.
:-)

I don't believe in POLLING or HZ=1000, but recetly tested them with
bge.  ...

How are you benchmarking this?


Just by blasting packets, usually with ttcp.

...
Well, I'm not exactly tied to polling.  I just tried it as an
alternative and, for at least part of the time, it's performed better
than non-polling.  I'm open to alternatives; I just want as close to
zero loss as possible.


Polling is not working acceptably for me at all.  I'm testing on the
same network and machine that are serving nfs/udp.  Apparently, with
polling there is an i/o error evey few seconds even under light loads,
and of course errors are especially bad for nfs/udp (nfs seems to
recover but takes about 1 minute).

...
[snip - "I've read polling(4) and it says..."]

I can (easily) generate only 250 kpps on input and had to increase
kern.polling.burst_max to > 250 to avoid huge packet lossage at this
rate.  It doesn't seem to work right for output, since I can (easily)
generate 340 kpps output and got that with a burst max of only 15
should have got only 150 kpps.  Output is faster at the lowest level
(but slower at higher levels), so doing larger bursts of output might
be intentional.  However, output at 340 kkps gives a system load of
100% on the test machine (which is not very fast or SMP).  no matter
how it is done (polling just makes it go 2% faster), so polling is not
doing its main job of very well.  Polling's main job is to prevent
netowork activity from using 100% CPU.  Large values of
kern.polling.burst_max are fundamentally incompatible with polling
doing this.  On my test system, a burst max of 1000 combined with HZ
= 1000 would just ask the driver alone to use 100% of the CPU doing
1000 kppps though a single device.  "Fortunately", the device can't
go that fast, so plenty of CPU is left.


That's for sending, right?  In this case that's not an issue.  I simply
have incoming traffic with MTUs of up to 9216 bytes that I want to
*receive*.  Never mind the fact that bge(4) and the underlying hardware
sucks in that it can't do that (although there's apparently a WinDOS
driver that can do it on the same hardware?!).  Again, my focus is on
sucking in packets as fast as possible with minimal loss.


Some bge hardware certainly supports jumbo frames.  Half of mine can, and
the other half is documented not to.

...
If I understand you correctly, it sounds like I'd be better off without
polling, particularly if there are *any* buffer limitations in the
Broadcom hardware.  Again, it's not idle; the lowest recorded packet
receive rate I've seen lately is around 40Kpkt/sec.  The lowest recorded
rate was around 16Kpkt/sec.


No, you seem to have the fairly specialized but common application where
polling currently works better, except for the problem with packet loss
which we don't completely understand but seems to be related to thread
priorities.

    * With polling on, kern.polling.burst_max=150:

      - kern.polling.burst holds at 150
      - 'vmstat 5' shows context switches hold around 2600, with
        interrupts holding around 30K


I think you mean `systat -vmstat 5'.  The interrupt count here is bogus.


No, I mean 'vmstat 5'.  I just let it dump a line every five seconds and
watch what happens.  Context switches and interrupts are both shown.
The 'systat' version, in this case, is harder for me to read; it also
lacks the scrolling history of 'vmstat'.  Sample output taken while
writing this (note that the first line is almost always bogus and sorry
if wrap is borked):


Ah, I forgot that I fixed some interrupt counting only in -current to
get a useful interrupt count in vmstat.  Software interrupts are still
put in the global interrupt count (but not in the software interrupt
count) in RELENG_6.  This makes them show up in vmstat output, and in
many configurations they dominate the global count so this count becomes
unrelated to the actual interrupt load.  In -current they are counted
as software interrupts only.  systat -vmstat reports interrupt counts
in finer detail so it is possible to determine various subcounts by
adding or subtracting the other counts.

...

      - 'vmstat -i' shows bge1 interrupt rate of 6286 (but total
        doesn't increase!), other rates stay the same (looks like
        possible display bugs in 'vmstat -i' here!)


Probably just averaging.


See, I'm not sure about that.  I thought that the whole point of polling
was to avoid interrupts.  Since the total count doesn't increase for
bge1 in 'vmstat -i' output, I interpreted it as a bug.


It's probably just the bogus software interrupt count.  Apparently, polling
generates 20-30 software interrupts per poll.  I don't know why it
generates so many, but the context switch count shows that most of them
don't generate a context switch, so most of them don't take much time.
Both software interrupts and hardware interrupts are currently counted
when they are requested, not when they delivered.  This is dubious but
works out OK for hardware interrupts only.  For hardware interupts,
even requests have a large overhead so requests that will coalesce
should be counted somewhere, but for software interrupts, requests have
a low overhead so the only reason to count requests that will coalesce
is to find and fix callers that make them.  I think that for hardware
interrupts, requests that will coalesce are rare in practice since the
first requst blocks subsequent ones.

I see the folowing packet loss for polling with HZ=1000, burst_max=300,
idle_poll=1:

%%%
            input         (bge0)           output
   packets  errs      bytes    packets  errs      bytes colls
    242999     1   14579940          0     0          0     0
    235496     0   14129760          0     0          0     0
    236930  3261   14215800          0     0          0     0
    237816  3400   14268960          0     0          0     0
    240418  3211   14425080          0     0          0     0
%%%


Well, I guess I'm doing OK, then.  With the same settings as above:

[EMAIL PROTECTED] % netstat -I bge1 -w 5
           input         (bge1)           output
  packets  errs      bytes    packets  errs      bytes colls
   614710     0  513122698          0     0          0     0
   662633     0  556662669          0     0          0     0
   639052     0  530704135          0     0          0     0
   706713     0  576938553          0     0          0     0
   690495     0  554269218          0     0          0     0
   682868     0  560234712          0     0          0     0
   692268     0  562487939          0     0          0     0
   680498     0  549782169          0     0          0     0
^C


Yes, I used -w 1 so my pps is about twice as much as yours, but I also use

tiny packets so as to get that high rate on low-end hardware, and that givesa bandwidth that is about 1/8 of yours.

Then again, it's after 1830 on a Friday afternoon, so traffic loads have
dropped a bit, so it's quite possible I'm not seeing anything dropped
here because of this relatively lighter load.


Problems are certainly more likely with higher pps.  140 kpps is quite
small.  I can almost reach that with tiny packets on an 100Mbps network.

In spite of the momentary 0% loss, do you think switching to an em(4),
sk(4), or other card might help?  The bge(4) interfaces are integrated
PCIe, and I think only PCI-X slots are available.


I believe em is (only slightly?) better but haven't used it.  The bus
matters most unless the card is really stupid.

Bruce
_______________________________________________
[email protected] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: Polling tuning and performance

Reply via email to