Roch - PAE wrote:
Garrett D'Amore writes:
> Roch - PAE wrote:
> > jason jiang writes:
> > > >From my experience, use the softintr to distribute the packets to
> > > upper layer will get poor performance on latency and througput
> > > performance than handle it in single interrupt thread. And you want to
> > > make sure do not handle too much packets in one interrupts.
> > >
> > >
> > > This message posted from opensolaris.org
> > > _______________________________________________
> > > networking-discuss mailing list
> > > [email protected]
> >
> >
> > I see that both interrupt scheme suffer from the same
> > drawback of pinning whatever thread happens to be running on
> > the interrupt/softintr cpu. The problem gets really annoying
> > when the incoming inter-packets time interval is smaller than the
> > handling time under the interrupt. Even if the code is set
> > to return after handling N packets, a new interrupt will be
> > _immediately_ signified and the pinning will keep on going.
> >
>
> That depends on the device and the driver. Its fully possible to
> acknowledge multiple interrupts. Most hardware I've worked with, if
> multiple packets arrive, and the interrupt on the device is not
> acknowledged, then multiple interrupts are not received. So if you
> don't acknowledge the interrupt until you think you're done processing,
> you probably won't take another interrupt when you exit. (You do need
> to check one last time after acknowledging the interrupt to prevent a
> lost packet race though.)
I agree about the interrupt being coalesced. The problem
that needs to be dealt with is when there is an endless
stream of inbound data with a inter-packet gap which is smaller
than the handling time (the ill-defined entity).
In this case, the driver/stack simply cannot keep up with the inbound
packets. This is the problem that solutions like 802.3x flow control
and RED (random early drop) are supposed to address. Otherwise you wind
up just losing packets.
Hopefully, you don't get so far behind that the system winds up
processing packets which are later discarded as being "stale". If that
happens then you have a serious problem. Modern CPUs with modern
devices/drivers shouldn't have that problem.
> > Now, the per-packet handling time, is not a well defined
> > entity. The software stack can choose to do more (say push up
> > through TCP/IP) or less work (just queue and wake kernel
> > thread) on each packet. All this needs to be managed based
> > on the load and we're moving in that direction.
> >
>
> There are other changes in the process... when the stack can't keep up
> with the inbound packets at _interrupt_ rate, the stack will have the
> ability to turn off interrupts on the device (if it supports it), and
> run the receive thread in "polling mode". This means that you have no
> interpacket context switches. It will stay in this mode until the
> poller empties the receive ring.
>
Perfect.
> > At the driver level, if you reach a point where you have a
> > large queue in the HW receive rings, that is a nice
> > indication that deferring the processing to a non-interrupt
> > kernel thread would be good. Under this condition the thread
> > wakeup cost is amortized over the handling of many packets.
> >
>
> Hmm... but you still have the initial latency for the first packet in
> the ring. Its not fatal, but its not nice to add 10msec latency if you
> don't have to, either.
Absolutely. The first packets are handled quickly as soon as
they arrive. The handling of the intial packets, is exactly
what can cause a backlog to build up in the HW. When the HW
has a backlog builtup, no sense to keep processing them
under the interrupt, latency is not the critical metric at
that point.
Why not deal with them under the interrupt? Assuming you can process
them all in the same interrupt context (i.e. you do not want to have to
service multiple interrupts, but once you're already in interrupt
context, if the packets keep coming, there is no real reason not to deal
with them there, as long as you can do so quickly, without tying up the
CPU from processing other system critical tasks. In multi-CPU systems,
the idea of dedicating an processor to handing rx interrupts from a
high-traffic NIC is actually very reasonable.)
To a certain extent, the question of the "context" that you are handling
the traffic in is one of resource allocation... interrupt context is a
bad place for a shared CPU to be, at least for very long. But if you
have the CPU to dedicate to the job, and the traffic to justify it,
leaving the CPU running in interrupt context works pretty well.
In fact, lately I've been doing a lot of performance testing with IP
forwarding of very small (64-byte) packets. I've found that the best
way to get good performance on systems with multiple CPUs is to allocate
a CPU to the task of interrupt handling for each high-traffic NIC.
Right now this requires some finagling with psrset and psradm -i, and
looking at bindings with mdb "::interrupts", but it is really
worthwhile. The performance boost you get by doing this kind of
tweaking can be nearly 100%.
(E.g. I've been able to process ~500,000 inbound packets per second on a
single 2.4GHz core using this technique. I'm looking at ways to
increase this number even further.)
> The details of this decision are moving up-stack
> though, in the form of squeues and polling with crossbow.
Right, I just suggest that the HW backlog might be one of the variable
involved in the decision.
If you have a HW backlog, you probably also have an upstream backlog.
But consideration of this should be taken into account.
Again, note that a lot of NICs on the market these days have support for
features like 802.3x, which is intended to help manage the backlog at
the link layer (in this case by providing flow control information to
the peer systems in the network.)
-- Garrett
-r
>
> We are looking at other ways to reduce per-packet processing overhead as
> well... stay tuned.
>
> -- Garrett
_______________________________________________
networking-discuss mailing list
[email protected]