Adaptive IRQ moderation (also called adaptive IRQ coalescing) has been widely
used
in the networking stack for over 20 years and has become a standard default
setting.
Adaptive moderation is a feature supported by the device to delay an interrupt
for a either a period of time, or number of completions (packets for networking
devices) in order to optimize the effective throughput by mitigating the cost of
handling interrupts which can be expensive in modern rates in networking and/or
storage devices.
The basic concept of adaptive moderation is to provide time and packet based
control
of when the device generates an interrupt in an adaptive fashion, attempting to
identify the current workload based on online gathered statistics instead of a
predefined configuration. When done correctly, adaptive moderation can win the
best of both throughput
This rfc patchset introduces a generic library that provides the mechanics for
online statistics monitoring and decision making for the consumer which simply
need to program the actual device while still keeping room for device specific
tuning.
In the networking stack, each device driver implements adaptive IRQ moderation
on its own. The approach here is a bit different, it tries to take the common
denominator,
which is per-queue statistics gathering and workload change identification
(basically decides if the moderation scheme needs to change).
The library is targeted to multi-queue devices, but should work on single queue
devices as well, however I'm not sure that these devices will need something
like interrupt moderation.
The model used in the proposed implementation requires the consumer (a.k.a the
device driver) to initialize an irq adaptive moderator context (a.k.a irq_am)
per
its desired context which will most likely be a completion queue.
The moderator is initialized with a set of am (adaptive moderation) levels,
which are essentially the abstraction of the device specific moderation
parameters.
Usually the different levels will map to different pairs of
(time <usecs>, completion-count) which are left specific to the consumer.
The moderator assumes that the am levels are sorted in an increasing order when
the
lowest level corresponds to the optimum latency tuning (short time and low
completion-count) and gradually increasing towards the throughput optimum tuning
(longer time and higher completion-count). So there is a trend and tuning
direction
tracked by the moderator. When the moderator collects sufficient statistics
(also
controlled by the consumer defining nr_events), it compares the current stats
with the
previous stats and if a significant changed was observed in the load, the
moderator
attempts to increment/decrement its current level (called a step) and schedules
a program
dispatch work.
The main reason why this implementation is different then the common networking
devices
implementation (and kept separate) is that in my mind at least, network devices
are different
animals than other I/O devices in the sense that:
(a) network devices rely heavily on byte count of raw ethernet frames for
adaptive moderation
while in storage or I/O, the byte count is often a result of a
submission/completion transaction
and sometimes even known only to the application on top of the
infrastructure (like in the
rdma case).
(b) Performance characteristics and expectations in representative workloads.
(c) network devices collect all sort of stats for different functionalities
(where adaptive moderation
is only one use-case) and I'm not sure at all that a subset of stats could
easily migrate to a different
context.
Having said that, if sufficient reasoning comes along, I could be convinced
otherwise if unification
of the implementations between networking and I/O is desired.
Additionally, I hooked two consumers into this framework:
1. irq-poll - interrupt polling library (used by various rdma consumers and can
be extended to others)
2. rdma cq - generic rdma cq polling abstraction library
With this, both RDMA initiator mode consumers and RDMA target mode consumers
are able to utilize the framework (nvme, iser, srp, nfs). Moreover, I currently
do
not see any reason why other HBAs (or devices in general) that support interrupt
moderation wouldn't be able to hook into this framework as well.
Note that the code is in *RFC* level, attempting to convey the concept.
If the direction is acceptable, the setup can be easily modified and
cleanups can be performed.
Initial benchmarking shows promising results with 50% improvement in throughput
when testing high load of small I/O. The experiment taken used nvme-rdma host
vs. nvmet-rdma
target exposing a null_blk device. The workload ran multithreaded fio run with
high queue-depth
and block size of 512B read I/O (4K block size would exceed 100 Gbe wire speed).
The results without adaptive moderation reaches ~8M IOPs bottlenecking the host
and target cpu.
With adaptive moderation enabled, IOPs quickly converge to ~12M IOPs (at the
expense of slightly
higher latencies obviously) and debugfs stats show that the moderation level
reached the
throughput optimum level. There is currently a known issue I've observed in
some conditions
converging back to latency optimum (after reaching throughput optimum am
levels) and I'll work
to fix the tuning algorithm. Thanks to Idan Burstein for running some
benchmarks on his
performance setup.
I've also tested this locally with my single core VMs and saw similar
improvement of ~50%
in throughput in a similar workload (355 KIOPs vs. 235 KIOPs). More testing
will help
a lot to confirm and improve the implementation.
QD=1 Latency tests showed a marginal regression of up to 2% in latency (lightly
tested though).
The reason at this point is that the moderator still bounces in the low latency
am levels
constantly (would like to improve that).
Another observed issue is the presence of user context polling (IOCB_HIPRI)
which does not update
the irq_am stats (mainly because its not interrupt driven). This can cause the
moderator to do
the wrong thing as its based on partial view of the load (optimize for latency
instead of getting
out of the poller's way). However, recent discussions raised the possibility
that polling requests
will be executed on a different set of queues with interrupts disabled
altogether, which would make
this a non-issue.
None the less, I would like to get some initial feedback on the approach. Also,
I'm not an expert
in tuning the algorithm. The basic approach was inspired by the mlx5 driver
implementation which
seemed the closest to fit the abstraction level that I was aiming for. So I'd
also love to get some
ideas on how to tune the algorithm better for various workloads (hence the RFC).
Lastly, I have also attempted to hook this into nvme (pcie), but that wasn't
successful mainly
because the coalescing set_feature is global to the controller and not
per-queue.
I'll be looking to bringing per-queue coalescing to the NVMe TWG (in case the
community is
interested in supporting this).
Feedback would be highly appreciated, as well as a test drive with the code in
case anyone
is interested :)
Sagi Grimberg (5):
irq-am: Introduce helper library for adaptive moderation
implementation
irq-am: add some debugfs exposure on tuning state
irq_poll: wire up irq_am and allow to initialize it
IB/cq: add adaptive moderation support
IB/cq: wire up adaptive moderation to workqueue based completion
queues
drivers/infiniband/core/cq.c | 73 ++++++++++-
include/linux/irq-am.h | 118 ++++++++++++++++++
include/linux/irq_poll.h | 9 ++
include/rdma/ib_verbs.h | 9 +-
lib/Kconfig | 6 +
lib/Makefile | 1 +
lib/irq-am.c | 291 +++++++++++++++++++++++++++++++++++++++++++
lib/irq_poll.c | 30 ++++-
8 files changed, 529 insertions(+), 8 deletions(-)
create mode 100644 include/linux/irq-am.h
create mode 100644 lib/irq-am.c
--
2.14.1