From: Van Haaren, Harry <harry.van.haa...@intel.com>
Sent: Friday, May 2, 2025 9:58 AM
To: Etelson, Gregory <getel...@nvidia.com>; Richardson, Bruce 
<bruce.richard...@intel.com>
Cc: dev@dpdk.org <dev@dpdk.org>; Owen Hilyard <owen.hily...@unh.edu>
Subject: Re: [PATCH] rust: RFC/demo of safe API for Dpdk Eal, Eth and Rxq

> From: Etelson, Gregory
> Sent: Friday, May 02, 2025 1:46 PM
> To: Richardson, Bruce
> Cc: Gregory Etelson; Van Haaren, Harry; dev@dpdk.org; owen.hily...@unh.edu
> Subject: Re: [PATCH] rust: RFC/demo of safe API for Dpdk Eal, Eth and Rxq
>
> Hello Bruce,

Hi All,
Hi All,

> > Thanks for sharing. However, IMHO using EAL for thread management in rust
> > is the wrong interface to expose.
>
> EAL is a singleton object in DPDK architecture.
> I see it as a hub for other resources.

Yep, i tend to agree here; EAL is central to the rest of DPDK working correctly.
And given EALs implementation is heavily relying on global static variables, it 
is
certainly a "singleton" instance, yes.
I think a singleton one way to implement this, but then you lose some of the 
RAII/automatic resource management behavior. It would, however, make some APIs 
inherently unsafe or very unergonomic unless we were to force rte_eal_cleanup 
to be run via atexit(3) or the platform equivalent and forbid the user from 
running it themselves. For a lot of Rust runtimes similar to the EAL (tokio, 
glommio, etc), once you spawn a runtime it's around until process exit. The 
other option is to have a handle which represents the state of the EAL on the 
Rust side and runs rte_eal_init on creation and rte_eal_cleanup on destruction. 
There are two ways we can make that safe. First, reference counting, once the 
handles are created, they can be passed around easily, and the last one runs 
rte_eal_cleanup when it gets dropped.  This avoids having tons of complicated 
lifetimes and I think that, everywhere that it shouldn't affect fast path 
performance, we should use refcounting. The other option is to use lifetimes. 
This is doable, but is going to force people who are more likely to primarily 
be C or C++ developers to dive deep into Rust's type system if they want to 
build abstractions over it. If we add async into the mix, as many people are 
going to want to do, it's going to become much, much harder. As a result, I'd 
advocate for only using it for data path components where refcounting isn't an 
option.

> Following that idea, the EAL structure can be divided to hold the
> "original" resources inherited from librte_eal and new resources
> introduced in Rust EAL.

Here we can look from different perspectives. Should "Rust EAL" even exist?
If so, why? The DPDK C APIs were designed in baremetal/linux days, where
certain "best-practices" didn't exist yet, and Rust language was pre 1.0 
release.

Of course, certain parts of Rust API must depend on EAL being initialized.
There is a logical flow to DPDK initialization, these must be kept for correct 
functionality.

I guess I'm saying, perhaps we can do better than mirroring the concept of
"DPDK EAL in C" in to "DPDK EAL in Rust".

I think that there will need to be some kind of runtime exposed by the library. 
A lot of the existing EAL abstractions may need to be reworked, especially 
those dealing with memory, but I think a lot of things can be layered on top of 
the C API. However, I think many of the invariants in the EAL could be enforced 
at compile time for free, which may mean the creation of a lot of "unchecked" 
function variants which skip over null checks and other validation.

As was mentioned before, it may also make sense for some abstractions in the C 
EAL to be lifted to compile time. I've spent a lot of time thinking about how 
to use something like Rust's traits for "it just works" capabilities where you 
can declare what features you want (ex: scatter/gather) and it will either be 
done in hardware or fall back to software, since you were going to need to do 
it anyway. This might lead to parameterizing a lot of user code on the devices 
they expect to interact with and then having some "dyn EthDev" as a fallback, 
which should be roughly equivalent to what we have now. I can explain that in 
more detail if there's interest.

> > Instead, I believe we should be
> > encouraging native rust thread management, and not exposing any DPDK
> > threading APIs except those necessary to have rust threads work with DPDK,
> > i.e. with an lcore ID. Many years ago when DPDK started, and in the C
> > world, having DPDK as a runtime environment made sense, but times have
> > changed and for Rust, there is a whole ecosystem out there already that we
> > need to "play nice with", so having Rust (not DPDK) do all thread
> > management is the way to go (again IMHO).
> >
>
> I'm not sure what exposed DPDK API you refer to.

I think that's the point :) Perhaps the Rust application should decide how/when 
to
create threads, and how to schedule & pin them. Not the "DPDK crate for Rust".
To give a more concrete examples, lets look at Tokio (or Monoio, or Glommio, or 
.. )
which are prominent players in the Rust ecosystem, particularly for networking 
workloads
where request/response patterns are well served by the "async" programming 
model (e.g HTTP server).
Rust doesn't really care about threads that much. Yes, it has std::thread as a 
pthread equivalent, but on Linux those literally call pthread. Enforcing the 
correctness of the Send and Sync traits (responsible for helping enforce thread 
safety) in APIs is left to library authors. I've used Rust with EAL threads and 
it's fine, although a slightly nicer API for launching based on a closure 
(which is a function pointer and a struct with the captured inputs) would be 
nice. In Rust, I'd say that async and threads are orthogonal concepts, except 
where runtimes force them to mix. Async is a way to write a state machine or 
(with some more abstraction) an execution graph, and Rust the language doesn't 
care whether a library decides to run some dependencies in parallel. What I 
think Rust is more likely to want is thread per core and then running either a 
single async runtime over all of them or an async runtime per core.

Lets focus on Tokio first: it is an "async runtime" (two links for future 
readers)
    <snip>
So an async runtime can run "async" Rust functions (called Futures, or Tasks 
when run independently..)
There are lots of words/concepts, but I'll focus only on the thread 
creation/control aspect, given the DPDK EAL lcore context.

Tokio is a work-stealing scheduler. It spawns "worker" threads, and then gives 
these "tasks"
to various worker cores (similar to how Golang does its work-stealing 
scheduling). Some
DPDK crate users might like this type of workflow, where e.g. RXQ polling is a 
task, and the
"tokio runtime" figures out which worker to run it on. "Spawning" a task causes 
the "Future"
to start executing. (technical Rust note: notice the "Send" bound on Future: 
https://docs.rs/tokio/latest/tokio/task/fn.spawn.html )
The work stealing aspect of Tokio has also led to some issues in the Rust 
ecosystem. What it effectively means is that every "await" is a place where you 
might get moved to another thread. This means that it would be unsound to, for 
example, have a queue handle on devices without MT-safe queues unless we want 
to put a mutex on top of all of the device queues. I personally think this is a 
lot of the source of people thinking that Rust async is hard, because Tokio 
forces you to be thread safe at really weird places in your code and has issues 
like not being able to hold a mutex over an await point.

Other users might prefer the "thread-per-core" and CPU pinning approach (like 
DPDK itself would do).
nit: Tokio also spawns a thread per core, it just freely moves tasks between 
cores. It doesn't pin because it's designed to interoperate with the normal 
kernel scheduler more nicely. I think that not needing pinned cores is nice, 
but we want the ability to pin for performance reasons, especially on NUMA/NUCA 
systems (NUCA = Non-Uniform Cache Architecture, almost every AMD EPYC above 8 
cores, higher core count Intel Xeons for 3 generations, etc).
Monoio and Glommio both serve these use cases (but in slightly different 
ways!). They both spawn threads and do CPU pinning.
Monoio and Glommio say "tasks will always remain on the local thread". In Rust 
techie terms: "Futures are !Send and !Sync"
    https://docs.rs/monoio/latest/monoio/fn.spawn.html
    https://docs.rs/glommio/latest/glommio/fn.spawn_local.html
There is also another option, one which would eliminate "service cores". We 
provide both a work stealing pool of tasks that have to deal with being yanked 
between cores/EAL threads at any time, but aren't data plane tasks, and then a 
different API for spawning tasks onto the local thread/core for data plane 
tasks (ex: something to manage a particular HTTP connection). This might make 
writing the runtime harder, but it should provide the best of both worlds 
provided we can build in a feature (Rust provides a way to "ifdef out" code via 
features) to disable one or the other if someone doesn't want the overhead.

So there are at least 3 different async runtimes (and I haven't even talked 
about async-std, smol, embassy, ...) which
all have different use-cases, and methods of running "tasks" on threads. These 
runtimes exist, and are widely used,
and applications make use of their thread-scheduling capabilities.

So "async runtimes" do thread creation (and optionally CPU pinning) for the 
user.
Other libraries like "Rayon" are thread-pool managers, those also have various 
CPU thread-create/pinning capabilities.
If DPDK *also* wants to do thread creation/management and CPU-thread-to-core 
pinning for the user, that creates tension.
The other problem is that most of these async runtimes have IO very tightly 
integrated into them. A large portion of Tokio had to be forked and rewritten 
for io_uring support, and DPDK is a rather stark departure from what they were 
all designed for. I know that both Tokio and Glommio have "start a new async 
runtime on this thread" functions, and I think that Tokio has an "add this 
thread to a multithreaded runtime" somewhere.

I think the main thing that DPDK would need to be concerned about is that many 
of these runtimes use thread locals, and I'm not sure if that would be 
transparently handled by the EAL thread runtime since I've always used thread 
per core and then used the Rust runtime to multiplex between tasks instead of 
spawning more EAL threads.

Rayon should probably be thought of in a similar vein to OpenMP, since it's 
mainly designed for batch processing. Unless someone is doing some fairly heavy 
computation (the kind where "do we want a GPU to accelerate this?" becomes a 
question) inside of their DPDK application, I'm having trouble thinking of a 
use case that would want both DPDK and Rayon.

> Bruce wrote: "so having Rust (not DPDK) do all thread management is the way 
> to go (again IMHO)."

I think I agree here, in order to make the Rust DPDK crate usable from the Rust 
ecosystem,
it must align itself with the existing Rust networking ecosystem.

That means, the DPDK Rust crate should not FORCE the usage of lcore pinnings 
and mappings.
Allowing a Rust application to decide how to best handle threading (via Rayon, 
Tokio, Monoio, etc)
will allow much more "native" or "ergonomic" integration of DPDK into Rust 
applications.
I'm not sure that using DPDK from Rust will be possible without either serious 
performance sacrifices or rewrites of a lot of the networking libraries. Tokio 
continues to mimic the BSD sockets API for IO, even with the io_uring version, 
as does glommio. The idea of the "recv" giving you a buffer without you passing 
one in isn't really used outside of some lower-level io_uring crates. At a bare 
minimum, even if DPDK managed to offer an API that works exactly the same ways 
as io_uring or epoll, we would still need to go to all of the async runtimes 
and get them to plumb DPDK support in or approve someone from the DPDK 
community maintaining support. If we don't offer that API, then we either need 
rewrites inside of the async runtimes or for individual libraries to provide 
DPDK support, which is going to be even more difficult.

I agree that forcing lcore pinnings and mappings isn't good, but I think that 
DPDK is well within its rights to build its own async runtime which exposes a 
standard API. For one thing, the first thing Rust users will ask for is a TCP 
stack, which the community has been discussing and debating for a long time. I 
think we should figure out whether the goal is to allow DPDK applications to be 
written in Rust, or to allow generic Rust applications to use DPDK. The former 
means that the audience would likely be Rust-fluent people who would have used 
DPDK regardless, and are fine dealing with mempools, mbufs, the eal, and ethdev 
configuration. The latter is a much larger audience who is likely going to be 
less tolerant of dpdk-rs exposing the true complexity of using DPDK. Yes, Rust 
can help make the abstractions better, but there's an amount of inherent 
complexity in "Your NIC can handle IPSec for you and can also direct all IPv6 
traffic to one core" that I don't think we can remove.

I personally think that making an API for DPDK applications to be written in 
Rust, and then steadily adding abstractions on top of that until we arrive at 
something that someone who has never looked at a TCP header can use without too 
much confusion. That was part of the goal of the Iris project I pitched (and 
then had to go finish another project so the design is still WIP). I think that 
a move to DPDK is going to be as radical of a change as a move to io_uring, 
however, DPDK is fast enough that I think it may be possible to convince people 
to do a rewrite once we arrive at that high level API. "Swap out your sockets 
and rework the functions that do network IO for a 5x performance increase" is a 
very, very attractive offer, but for us to get there I think we need to have 
DPDK's full potential available in Rust, and then build as many zero-overhead 
(zero cost or you couldn't write it better yourself) abstractions as we can on 
top. I want to avoid a situation where we build up to the high-level APIs as 
fast as we can and then end up in a situation where you have "Easy Mode" and 
then "C DPDK written in Rust" as your two options.
> Regards,
> Gregory

Apologies for the long-form, "wall of text" email, but I hope it captures the 
nuance of threading and
async runtimes, which I believe in the long term will be very nice to capture 
"async offload" use-cases
for DPDK. To put it another way, lookaside processing can be hidden behind 
async functions & runtimes,
if we design the APIs right: and that would be really cool for making 
async-offload code easy to write correctly!

Regards, -Harry

Sorry for my own walls of text. As a consequence of working on Iris I've spent 
a lot of time thinking about how to make DPDK easier to use while keeping the 
performance intact, and I was already thinking in Rust since it provides one of 
the better options for these kinds of abstractions (the other option I see is 
Mojo, which isn't ready yet). I want to see DPDK become more accessible, but 
the performance and access to hardware is one of the main things that make DPDK 
special, so I don't want to compromise that. I definitely agree that we need to 
force DPDK's existing APIs to justify themselves in the face of the new 
capabilities of Rust, but I think that starting from "How are Rust applications 
written today?" is a mistake.

Regards,
Owen

Reply via email to