On 05.08.2013 23:53, Luigi Rizzo wrote:
On Mon, Aug 05, 2013 at 11:04:44PM +0200, Andre Oppermann wrote:
On 05.08.2013 19:36, Luigi Rizzo wrote:
...

[picking a post at random to reply in this thread]
tell whether or not we should bail out).

Ideally we don't want to have any locks in the RX and TX path at all.

Ok i have read your proposal below but there are a few things
I do not completely understand, below -- essentially I do not
understand whether the removal of IFF_DRV_RUNNING (or equivalent)
forces you to replace the ifp->new_tx_function() with something
else when you want to do an "ifconfig down"

Sorry, it's IFF_DRV_OACTIVE that'll go away, not IFF_DRV_RUNNING.
It's of no use with multi-queue anyways.  Though we may get more
differentiated states for interface availability.

Specifically, here are my doubts:

- Considering that the rxq lock is rarely contended
   (only when the control path is active, which in your approach
   below requires killing and restarting the ithread),
   and is acquired/released only once per interrupt/batch,
   i am under the impression that the per-queue RX lock is
   not a performance problem and makes it easier to reason
   on the code (and does not require different approach
   for MSI-x and other cases).

There are two important things here: 1) there must only be one
(i)thread per RX queue to prevent lock contention; 2) even with
a non-contended lock there are effects felt by the other cores
or CPUs like bus lock cycles which are considerable.  Stopping
and restarting the ithread/taskqueue in those cases that require
more involved (hardware) reconfiguration isn't much of an overhead
compared to per-packet or per-packet-batch locking in the hot path.

- in the tx datapath, do you want to acquire the txq lock
   before or after calling ifp->new_tx_function() ?
   (btw how it compares to ifp->if_start or ifp->if_transmit ?)

Struct ifnet is going to be split into two parts, the stack owned
part and the driver owned part.  The stack will never fumble the
driver owned parts and the driver must only change the stack owned
parts through accessor functions.  (This split has also been proposed
by Juniper quite some time ago but for different reasons).

The driver supplies a TX frame transmit function (mostly like if_transmit
today) which does all locking and multi-queue handling internally (driver
owned.  This gives driver writers the freedom to better adjust to different
hardware communication methods as they become available, like we witnessed
a couple of times in the past.

If the driver is multi-queue capable it takes the rss-hash from the packet 
header
to select an appropriate queue which may have its own dedicated lock.  If there
is only one queue then it will obtain that lock which may see some contention
when multiple cores try to send at the same time.  The driver may do more
extensive locking, however that likely comes at the expense of performance.

Typically on modern NICs the TX function will be a thin locked wrapper around
the DMA enqueue function.  For or older or other kinds of interfaces it may
implement full internal soft queuing (as the IFQ did).  An efficient generic
implementation of those will be provided for driver to make use of.

>    If you do it within the tx handler then you lose the ability
>    of replacing the handler when you do a reconfiguration,
>    because grabbing the tx lock in the control path does not tell
>    you whether anyone is in the handler.
>    If you do it outside, then the driver should also expose the locks,
>    or some locking function.

The stack calls the transmit function without any driver-specific locks held.
It'll make sure though that the stack-ifnet doesn't go away in between probably
by using cheap rmlocks.

The drivers control path gets a function to ensure that the stack has drained
all writers and it is safe to reconfigure (as in callout_drain()).  Not all
hardware and control path changes necessarily require a reinitialization.

The stack-ifnet shadows some information, like interface state, and may do its
own independent SMP optimizations to avoid contention.

Overall, it seems to me that keeping the IFF_DRV_RUNNING
flag is a lot more practical, not to mention fewer modifications
to the code.

Generally we want to optimize for the fast packet path, reduce any friction we
can, and take a hit on reconfig being more "expensive" as it is much less
frequent.

Mind you not all of this is worked out in detail yet and may be subject to
further changes and refinements as more benchmarking of different approaches
is performed.

--
Andre

[PS: I'm currently on summer vacation and write this while having access]

[description of Andre's proposal below, for reference]

        cheers
        luigi

Every queue has its own separate RX interrupt vector which is serviced by
a single dedicated ithread (or taskqueue) which does while(1) for work on
the RX ring only going to sleep when it reaches the end of work on the ring.
It is then woken up by an interrupt to resume work.  To prevent a live-lock
on the CPU it would periodically yield to allow other kernel and user-space
threads to run in between.  Each queue's ithread (taskqueue) can be pinned
to a particular core or float with a certain stickiness.  To reconfigure the
card (when the RX ring is affected) the ithread (taskqueue) is terminated
and after the changes restarted again.  This is done by the control path
and allows us to run RX completely lock free because there is only ever one
ithread accessing the RX queue doing away with the need for further locking
synchronization.

ok I admit i did not think of killing and restarting the ithread,
but i wonder how
RX with MSI or legacy interrupts:

Here multi-queue is impeded because of some synchronization issues.  Depending
on the hardware synchronization model the ithreads again do while(1) but have
to be made runnable by the interrupt handler when new work has been indicated.

TX in general:

This is a bit more tricky as we have the hard requirement of packet ordering
for individual sessions (tcp and others).  That means we must use the same
queue for all packets of the same session.  This makes running TX non-locked
a bit tricky because when we pin a TX queue to a core we must switch to that
core first before adding anything to the queue.  If the packet was generated
on that core there is no overhead, OTOH if not there is the scheduler over-
head and some cache losses.  Ensuring that a the entire TX path, possibly
including user-space (write et al) happens on the same core is difficult and
may have its own inefficiencies.  The number of TX queue vs. number of cores
is another point of contention.  To make the 1:1 scheme work well, one must
have as many queues as cores.

A more scalable and generic approach doesn't bind TX queues to any particular
core and is covers each by its own lock to protect the TX ring enqueue.  To
prevent false lock cache line sharing each TX structure should be on its own
cache line.  As each session has an affinity hash they become distributed
across the TX queues significantly reducing contention.

The TX complete handling freeing the mbuf(s) is done the same way as for the
RX ring with a dedicated ithread (taskqueue) for each.  Also bpf should hook
in here (the packet has been sent) instead of at the TX enqueue time.

The whole concept of IFF_DRV_RUNNING will go away along with the IFQ macros.
Each driver then provides a direct TX function pointer which is put into a
decoupled ifnet TX field for use by the stack.  Under normal operation all
interface TX goes directly into TX DMA ring.  The particular multi-queue
and locking model is decided by the driver.  The kernel provides a couple
of common optimized abstractions for use by all drivers that want/need it
to avoid code and functionality duplication.  When things like ALTQ are
activated on an interface the ifnet TX function pointer is replaced with
the appropriate intermediate function pointer which eventually will call
the drivers TX function.  The intermediate TX packet handler (ALTQ) can
do its own additional locking on top of the drivers TX locking.

Control path:

The control path is separate from the high speed TX and RX data paths.
It has its own overall lock.  Multiple locks typically don't make sense
here.  If it needs to modify, reconfigure, or newly set up the TX or RX
rings then it has to stop the RX ithreads (taskqueues) and to lock the
TX queue locks before it can do that.  After the changes are made and
stable packet processing can continue.

I've adjusted / heavily modified fxp, bge and igb drivers in my tcp_workqueue
branch to test these concepts.  Not all of it is committed or fully up to date.

--
Andre




_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Reply via email to