http://www.freebsd.org/projects/netperf/
Contents
Project Goal
The netperf project is working to enhance the performance of the
FreeBSD network
stack. This work grew out of the SMPng
Project, which moved the
FreeBSD kernel from a "Giant Lock" to more fine-grained locking and
multi-threading.
SMPng offered both performance improvement and degradation for the
network stack,
improving parallelism and preemption, but substantially increasing
per-packet processing
costs. The netperf project is primarily focussed on further improving
parallelism in
network processing while reducing the SMP synchronization overhead.
This in turn will
lead to higher processing throughput and lower processing latency.
Project Strategies
Robert Watson
The two primary focuses of this work are to increase parallelism
while decreasing
overhead. Several activities are being performed that will work toward
these goals:
-
Complete locking work to make sure all components of the stack
are able to run without
the Giant lock. While most of the network stack, especially mainstream
protocols, runs
without Giant, some components require Giant to be placed back over the
stack if compiled
into the kernel, reducing parallelism.
-
Optimize locking strategies to find better balances between
locking granularity and
locking overhead. In the first cut at locking for the kernel, the goal
was to adopt a
medium-grained locking approach based on data locking. This approach
identifies critical
data structures, and inserts new locks and locking operations to
protect those data
structures. Depending on the data model of the code being protected,
this may lead to the
introduction of a substantial number of locks offering unnecessary
granularity, where the
overhead of locking overwhelms the benefits of available parallelism
and preemption. By
selectively reducing granularity, it is possible to improve performance
by decreasing
locking overhead.
-
Amortize the cost of locking by processing queues of packets or
events. While the cost
of individual synchronization operations may be high, it is possible to
amortize the cost
of synchronization operations by grouping processing of similar data
(packets, events)
under the same protection. This approach focuses on identifying places
where similar
locking occurs frequently in succession, and introducing queueing or
coalescing of lock
operations across the body of the work. For example, when a series of
packets is inserted
into an outgoing interface queue, a basic locking approach would lock
the queue for each
insert operation, unlock it, and hand off to the interface driver to
begin the send,
repeating this sequence as required. With a coalesced approach, the
caller would pass off
a queue of packets in order to reduce the locking overhead, as well as
eliminate
unnecessary synchronization due to the queue being thread-local. This
approach can be
applied at several levels in the stack, and is particularly applicable
at lower levels of
the stack where streams of packets require almost identical processing.
-
Introduce new synchronization strategies with reduced overhead
relative to traditional
strategies. Most traditional strategies employ a combination of
interrupt disabling and
atomic operations to achieve mutual exclusion and non-preemption
guarantees. However,
these operations are expensive on modern CPUs, leading to the desire
for cheaper
primitives with weaker semantics. For example, the application of
uni-processor
primitives where synchronization is required only on a single
processor, and
optimizations to critical section primitives to avoid the need for
interrupt
disabling.
-
Modify synchronization strategies to take advantage of
additional, non-locking,
synchronization primitives. This approach might take the form of making
increased use of
per-CPU or per-thread data structures, which require little or no
synchronization. For
example, through the use of critical sections, it is possible to
synchronize access to
per-CPU caches and queues. Through the use of per-thread queues, data
can be handed off
between stack layers without the use of synchronization.
-
Increase the opportunities for parallelism through increased
threading in the network
stack. The current network stack model offers the opportunity for
substantial
parallelism, with outbound processing typically taking place in the
context of the
sending thread in kernel, crypto occurring in crypto worker threads,
and receive
processing taking place in a combination of the receiving ithread and
dispatched netisr
thread. While handoffs between threads introduces overhead
(synchronization, context
switching), there is the opportunity to increase parallelism in some
workloads through
introducing additional worker threads. Identifying work that may be
relocated to new
threads must be done carefully to balance overhead, and latency
concerns, but can pay off
by increasing effective CPU utilization and hence throughput. For
example, introducing
additional netisr threads capable of running on more than one CPU at a
time can increase
input parallelism, subject to maintaining desirable packet ordering.
Project Tasks
| Task |
Responsible |
Last updated |
Status |
Notes |
| Prefer file descriptor reference counts to socket reference
counts for system
calls. |
Robert Watson |
20041124 |
Done |
Sockets and file descriptors both have reference counts in
order to prevent these
objects from being free'd while in use. However, if a file descriptor
is used to reach
the socket, the reference counts are somewhat interchangeable, as
either will prevent
undesired garbage collection. For socket system calls, overhead can be
reduced by relying
on the file descriptor reference count, thus avoiding the synchronized
operations
necessary to modify the socket reference count, an approach also taken
in the VFS code.
This change has been made for most socket system calls, and has been
committed to HEAD
(6.x). It has also been merged to RELENG_5 for inclusion in 5.4. |
| Mbuf queue library |
Robert Watson |
20041124 |
Prototyped |
In order to facilitate passing off queues of packets between
network stack
components, create an mbuf queue primitive, struct mbufqueue. The
initial implementation
is complete, and the primitive is now being applied in several sample
cases to determine
whether it offers the desired semantics and benefits. The
implementation can be found in
the rwatson_dispatch Perforce branch. Additional work must also be done
to explore the
performance impact of "queues" vs arrays of mbuf pointers, which are
likely to behave
better from a caching perspective. |
| Employ queued dispatch in interface send API |
Robert Watson |
20041106 |
Prototyped |
An experimental if_start_mbufqueue() interface to struct
ifnet has been added, which
passes an mbuf queue to the device driver for processing, avoiding
redundant
synchronization against the interface queue, even in the event that
additional queueing
is required. This has not yet been benchmarked. A subset change to
dispatch a single mbuf
to a driver has also been prototyped, and benchmarked at a several
percentage point
improvement in packet send rates from user space. |
| Employ queued dispatch in the interface receive API |
Robert Watson |
20041106 |
New task |
Similar to if_start_mbufqueue, allow input of a queue of
mbufs from the device driver
into the lowest protocol layers, such as ether_input_mbufqueue. |
| Employ queued dispatch across netisr dispatch API |
Robert Watson |
20041124 |
Prototyped |
Pull all of the mbufs in the netisr ifqueue out of the
ifqueue into a thread-local
mbuf queue to avoid repeated lock operations to access the queue. Also
use lock-free
operations to test for queue contents being present. This has been
prototyped in the
rwatson_netperf branch. |
| Modify UMA allocator to use critical sections not mutexes for
per-CPU caches. |
Robert Watson |
20050429 |
Done |
The mutexes protecting per-CPU caches require atomic
operations on SMP systems; as
they are per-CPU objects, the cost of synchronizing access to the
caches can be reduced
by combining CPU pinning and/or critical sections instead. This change
has now been
committed and will appear in 6.0-RELEASE; it results in a several
percentage performance
in UDP send from user space, and there have been reports of 20%+
improvements in
allocation intensive code within the kernel. In micro-benchmarks, the
cost of allocation
on SMP is dramatically reduced. |
| Modify malloc(9) allocator to use per-CPU statistics with
critical sections to
protect malloc_type statistics rather than global statistics with a
mutex. |
Robert Watson |
20050529 |
Done |
Previously, malloc(9) used a single statistics structure
protected by a mutex to hold
global malloc statistics for each malloc type. This change moves to
per-CPU statistics
structures, which are coalesced when reporting memory allocation
statistics to the user,
and protects them using critical sections. This reduces cache line
contention for common
allocation types by avoiding shared lines, and also reduces
synchronization costs by
using critical sections to synchronize access instead of a mutex. While
malloc(9) is less
frequently used in the network stack than uma(9), it is used for socket
address data, so
is on performance critical paths for datagram operations. This has been
committed and
will appear in 6.0-RELEASE. |
| Optimize critical section performance |
John Baldwin |
20050404 |
Done |
Critical sections prevent preemption of a thread on a CPU, as
well as preventing
migration of that thread to another CPU, and maybe used for
synchronizing access to
per-CPU data structures, as well as preventing recursion in interrupt
processing.
Currently, critical sections disable interrupts on the CPU. In previous
versions of
FreeBSD (4.x and before), optimizations were present that allowed for
software interrupt
disabling, which lowers the cost of critical sections in the common
case by avoiding
expensive microcode operations on the CPU. By restoring this model, or
a variation on it,
critical sections can be made substantially cheaper to enter. In
particular, this change
will lower the cost of critical sections on UP such that it is
approximately the same
cost as a mutex, meaning that optimizations on SMP to use critical
sections instead of
mutexes will not harm UP performance. This change has now been
committed, and will appear
in 6.0-RELEASE. |
| Normalize socket and protocol control block reference model |
Robert Watson |
20060401 |
Done |
The socket/protocol boundary is characterized by a set of
data structures and API
interfaces, where the socket code acts as both a consumer and a service
library for
protocols. This task is to normalize the reference model by which
protocol state is
attached to and detached from socket state in order to strengthen
invariants, allowing
the removal of countless unused code paths (especially error handling),
the removal of
unnecessary locking in TCP, and a general improve the structure of the
code. This serves
both the immediate purpose of improving the quality and performance of
this code, as well
as being necessary for future optimization work. These changes have
been prototyped in
Perforce, and now merged to 7-CURRENT. They will be merged into
RELENG_6 once they have
been thoroughly tested. |
| Add true inpcb reference count support |
Mohan Srinivasan, Robert Watson, Peter Wemm |
20060412 |
New task |
Currently, the in-bound TCP and UDP socket paths rely on the
global pcbinfo info
locks to prevent PCBs being delivered to from being garbage collected
by another thread
while in use. This set of changes introduces a true reference model for
PCBs so that the
global lock can be released during in-bound process. |
| Fine-grained locking for UNIX domain sockets |
Robert Watson |
20060416 |
Prototyped |
Currently, UNIX domain sockets in FreeBSD 5.x and 6.x use a
single global subsystem
lock. This is sufficient to allow it to run without Giant, but results
in contention with
large numbers of processors simultaneously operating on UNIX domain
sockets. This task
introduces per-protocol control block locks in order to reduce
contention on a larger
subsystem lock. |
Netperf Cluster
Through the generous donations and investment of Sentex Data
Communications, FreeBSD
Systems, IronPort Systems, and the FreeBSD Foundation, a network
performance testbed has
been created in Ontario, Canada for use by FreeBSD developers working
in the area of
network performance. A similar cluster, made possible through the
generous donation of
Verio, is being prepared for use in more general SMP performance work
in Virginia, US.
Each cluster consists of several SMP systems inter-connected with
giga-bit ethernet such
that relatively arbitrary topologies can be constructed in order to
test host-host, IP
forwarding, and bridging performance scenarios. Systems are network
booted, have serial
console, and remote power, in order to maximize availability and
minimize configuration
overhead. These systems are available on a check-out basis for
experimentation and
performance measurement to FreeBSD developers working on the Netperf
project, and in
related areas.
More
detailed information on the netperf cluster can be found
by following this link.
Papers and Reports
The following paper(s) have been produced by or are related to the
Netperf
Project:
Additional papers can be found on the SMPng Project web
page.
Links
Some useful links relating to the netperf work:
|