This is an automated email from the ASF dual-hosted git repository.
maciej pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/iggy-website.git
The following commit(s) were added to refs/heads/main by this push:
new f2b3eaea Fix grammar and punctuation in blog post
f2b3eaea is described below
commit f2b3eaead3f48fc0b79b33bdb82fd3ce0d361af9
Author: Maciej Modzelewski <[email protected]>
AuthorDate: Fri Feb 27 09:32:32 2026 +0100
Fix grammar and punctuation in blog post
---
content/blog/thread-per-core-io_uring.mdx | 28 ++++++++++++++--------------
1 file changed, 14 insertions(+), 14 deletions(-)
diff --git a/content/blog/thread-per-core-io_uring.mdx
b/content/blog/thread-per-core-io_uring.mdx
index 0ae93279..b9ed6581 100644
--- a/content/blog/thread-per-core-io_uring.mdx
+++ b/content/blog/thread-per-core-io_uring.mdx
@@ -15,17 +15,17 @@ Apache Iggy utilized `tokio` as its async runtime, which
uses a multi-threaded w
When `tokio` starts, it spins up `N` worker threads (typically one per core)
that continuously execute and reschedule `Futures`. The scheduler decides on
which worker a particular `Future` gets to run, which can lead to task
migrations between workers, cache invalidations, and less predictable execution
paths. While Rust `Send` and `Sync` bounds prevent data-race undefined
behavior, they do not prevent higher-level concurrency bugs such as
[deadlocks](https://github.com/apache/iggy/pull/1567).
-But even these challenges weren't what finally tipped us over the edge. The
way `tokio` handles block device I/O was the real dealbreaker. Tokio, following
the poll-based Rust `Futures` model, uses (depending on the platform) a
notification-based mechanism to perform I/O on file descriptors. The runtime
subscribes for a readiness notification for a particular descriptor and
`awaits` the readiness in order to submit the I/O operation. While this works
decently well for network sockets, it [...]
+But even these challenges weren't what finally tipped us over the edge. The
way `tokio` handles block device I/O was the real dealbreaker. Tokio, following
the poll-based Rust `Futures` model, uses (depending on the platform) a
notification-based mechanism to perform I/O on file descriptors. The runtime
subscribes for a readiness notification for a particular descriptor and
`awaits` the readiness in order to submit the I/O operation. While this works
decently well for network sockets, it [...]
## Thread per core shared nothing architecture
The thread-per-core shared-nothing architecture is what we landed on when it
comes to improving the scalability of Apache Iggy. It has been proven to be
successful by high-performance systems such as
[ScyllaDB](https://github.com/scylladb/scylladb) and
[Redpanda](https://github.com/redpanda-data/redpanda), both of those projects
utilize the [Seastar](https://github.com/scylladb/seastar) framework to achieve
their performance goals.
-In short, the core philosophy behind this approach is to pin a single thread
to each CPU core, partition your resources based on an heuristic (commonly
hashing), eliminate shared state, thereby [reduce lock contention and improve
cache
locality](https://www.scylladb.com/2024/10/21/why-scylladbs-shard-per-core-architecture-matters/)
and finally, use message passing for communication between those threads, also
known as `shards` in `Seastar` terminology. Sounds like a good plan, but as wit
[...]
+In short, the core philosophy behind this approach is to pin a single thread
to each CPU core, partition your resources based on a heuristic (commonly
hashing), eliminate shared state, thereby [reduce lock contention and improve
cache
locality](https://www.scylladb.com/2024/10/21/why-scylladbs-shard-per-core-architecture-matters/)
and finally, use message passing for communication between those threads, also
known as `shards` in `Seastar` terminology. Sounds like a good plan, but as
with [...]

From a bird's-eye view, this architecture solves the primary issues of our
previous approach: we move from **work stealing** to **work steering**. That's
a big W, but we were still left with block-device I/O. Using a thread pool for
file operations would ultimately negate the performance gains from core
pinning, so we needed a truly asynchronous I/O interface, and that is how we
discovered `io_uring`.
-There is plethora of materials regarding `io_uring` as it's the hot thing, but
very briefly the interface is straight forward, `io_uring` rather than being a
notification system (readiness based), it's completion based, you submit the
operation and the kernel drives it to completion. The core mechanism revolves
around two lock-free ring buffers shared between user space and the kernel: the
**Submission Queue (SQ)**, where your application enqueues I/O requests, and
the **Completion Queue [...]
+There is plethora of materials regarding `io_uring` as it's the hot thing, but
very briefly the interface is straightforward, `io_uring` rather than being a
notification system (readiness based), it's completion-based, you submit the
operation and the kernel drives it to completion. The core mechanism revolves
around two lock-free ring buffers shared between user space and the kernel: the
**Submission Queue (SQ)**, where your application enqueues I/O requests, and
the **Completion Queue [...]
## Pick your poison
With all the design pieces in place, it was time to visit the marketplace of
**async runtimes**. We evaluated 3 candidates:
@@ -42,10 +42,10 @@ Next on the list
[glommio](https://github.com/DataDog/glommio) - this one is par
Finally, [compio](https://github.com/compio-rs/compio) - this is what we ended
up using. It's very similar to `monoio` in terms of architecture, but it stands
out for its broad `io_uring` feature coverage, active maintenance (our patches
got [merged within hours](https://github.com/compio-rs/compio/pull/440)), and
its codebase structure. Unlike `monoio`, the `compio` codebase is structured in
a way where the `driver` is disaggregated from the `executor`, meaning that one
can build their [...]
-Notably, `compio` boxes the I/O request that is submitted to the SQ, which
means that every I/O request incurs a heap allocation, something that `monoio`
avoids. In our case it's not that big of a deal, as those allocations are very
small and `mimalloc` is quite good at maintaining a pool for small, predictable
allocations. We did raise the question in their `Telegram` channel about
whether it would be feasible to use a `Slab` allocator the approach that
`monoio` takes, but the authors d [...]
+Notably, `compio` boxes the I/O request that is submitted to the SQ, which
means that every I/O request incurs a heap allocation, something that `monoio`
avoids. In our case, it's not that big of a deal, as those allocations are very
small and `mimalloc` is quite good at maintaining a pool for small, predictable
allocations. We did raise the question in their `Telegram` channel about
whether it would be feasible to use a `Slab` allocator the approach that
`monoio` takes, but the authors [...]
## Devil's speech
-Remember how we mentioned that **the devil is in the details** ? Let's give
him mic now.
+Remember how we mentioned that **the devil is in the details**? Let's give him
mic now.
At first glance since the thread-per-core shared-nothing model all state is
local to each shard and anything that requires a **global** view must be
replicated across shards via message passing, it looks like a perfect candidate
for **Interior mutability**, replace your `Mutexes` with `RefCells` and run
with the quick win. If you thought that, I've got bad news, you'd be greeted
straight from the ninth circle of Dante's Inferno with:
> thread 'shard-8' (496633) panicked at
> core/server/src/streaming/topics/helpers.rs:298:21:
@@ -55,7 +55,7 @@ Turns out that holding a `RefCell` borrow across an `.await`
point can cause run
The Rust `wg-async` (async working group) seems to be aware of that footgun
and describes it in [this
story](https://rust-lang.github.io/wg-async/vision/submitted_stories/status_quo/barbara_wants_to_use_ghostcell.html).
It *feels* like it should be possible to express statically-checked borrowing
for `Futures` using primitives such as `GhostCell`, they even share a
[proof-of-concept runtime](https://crates.io/crates/stakker) that does exactly
that, but achieving an ergonomic API indistin [...]
-We didn't give up (yet) on interior mutability, rather, we reasoned about the
underlying problem and attempted to solve it with better a API.
+We didn't give up (yet) on interior mutability, rather, we reasoned about the
underlying problem and attempted to solve it with a better API.
The issue is that during `.await` points, the executor can potentially yield
the execution context to another `Future`, and that other `Future` may attempt
to borrow the same `RefCell`, causing a panic at runtime since the borrow from
the first `Future` is still active. We ran into this often because our data
structures followed an OOP-style of **compile time hierarchy that matches the
domain model**, which looked akin to that.
@@ -94,14 +94,14 @@ The `save` procedure can be split into two parts
- The mutation of the in-memory state
- The I/O operation using `storage`
- This way our `RefCell` can be much more granular, we use it only for the
in-memory representation of `Stream`, while the storage is stored out of
bounds, but for that we needed a bigger gun, let us introduce **ECS** (Entity
Component System).
+ This way our `RefCell` can be much more granular, we use it only for the
in-memory representation of `Stream`, while the storage is stored out of
bounds, but for that, we needed a bigger gun, let us introduce **ECS** (Entity
Component System).
One might be familiar with `ECS` from game engines, not from message streaming
platforms, personally I think the general idea behind ECS - **SOA** (Struct of
arrays) is fairly underrated in general.
What we did is split the `Entities` (Streams, Topics, Partitions, etc.) into
their components, where each component is stored in its own dedicated
collection.

-In this case our components are `State` and `Storage`. This allows us to write:
+In this case, our components are `State` and `Storage`. This allows us to
write:
```rs
struct Streams {
@@ -130,9 +130,9 @@ Well, this approach crumbles just as miserably as the
*naive* attempt...
The thread-per-core shared-nothing architecture requires broadcasting events
whenever state changes on one shard. For example, if `shard-0` receives a
`CreateStream` request, once it finishes processing, it broadcasts a
`CreatedStream` event through a channel to all other shards. On the receiving
end, each shard has a background task that polls this channel for incoming
events. The crux of the issue lies in the word **background**.
-
+
-In our `Streams` example, it might not look like a big deal, but in reality
our other `Entities` were much more complicated, without even introducing other
background workers that weren't necessary as part of the thread-per-core shared
nothing architecture. An solution to this problem could be using `async` lock,
but those can be [footguns aswell](https://rfd.shared.oxide.computer/rfd/0400).
+In our `Streams` example, it might not look like a big deal, but in reality
our other `Entities` were much more complicated, without even introducing other
background workers that weren't necessary as part of the thread-per-core shared
nothing architecture. A solution to this problem could be using `async` lock,
but those can be [footguns aswell](https://rfd.shared.oxide.computer/rfd/0400).
To our surprise, the issue persisted even in scenarios where we enforced a
single-writer principle (we dedicated one shard to become the serialization
point for all requests), which was the final nail in the coffin that led us to
conclude the experiment as failure. Maintaining a non-shared but consistent
state is much more difficult, than *just use message passing bro*.
@@ -142,7 +142,7 @@ To our surprise, the issue persisted even in scenarios
where we enforced a singl
After a long fight with `interior mutability`, we gave up on trying to make
fetch happen. Instead, we doubled down on the artifact from the previous
iteration (the single-writer principle). We divided our `resources` into two
groups: shared, strongly consistent resources and sharded, eventually
consistent ones. An example of a sharded resource is `Partition`, while
`Streams` and `Topics` remain shared and strongly consistent, this split later
on coined name (Control Plane/Data Plane).
-For shared resources, we decided to use
[`left-right`](https://github.com/jonhoo/left-right), a concurrent data
structure designed for a single writer and multiple readers. It works by
maintaining two pointers to the underlying data: one for readers and one for
the writer. During a writer commit, those pointers are swapped atomically
(greatly simplifying). The single writer is the first shard - `shard0`, while
remaining shards have an `read` handle to the data. In case if an shard other
[...]
+For shared resources, we decided to use
[`left-right`](https://github.com/jonhoo/left-right), a concurrent data
structure designed for a single writer and multiple readers. It works by
maintaining two pointers to the underlying data: one for readers and one for
the writer. During a writer commit, those pointers are swapped atomically
(greatly simplifying). The single writer is the first shard - `shard0`, while
remaining shards have an `read` handle to the data. In case if a shard other t
[...]
As for our partitions, we maintain one shared table (DashMap) called
`shards_table` that functions as barrier to fence requests that would try to
access `Partition` that is in the process of creation/deletion, the requests
are still routed to appropriate shard that contains the `Partition`, but by
consulting the `shards_table` (during the routing and after the routing), we
make sure that the eventual consistency does not come to bite us.
@@ -154,7 +154,7 @@ This design turned out to be a can of worms, or a
bottomless pit, if you prefer.
We can exploit the fact that our `Partition` uses **segmented log**, thus the
partition can be sharded even harder based on the segment range and knowledge
of which segments are sealed.
-Getting the performance benefits out of `io_uring` itself is a challenge on
it's own (it's not enough to just swap `tokio` with an `io_uring` based
runtime), in order to fully take advantage of the benefits from the `io_uring`
design one have to heavily batch syscalls, as this is the main advantage of
such interface (less context switches, from userspace to kernel space), Rust
`Futures` can be composed together pretty well to facilitate that, but you have
to be careful!
+Getting the performance benefits out of `io_uring` itself is a challenge on
its own (it's not enough to just swap `tokio` with an `io_uring` based
runtime), in order to fully take advantage of the benefits from the `io_uring`
design one has to heavily batch syscalls, as this is the main advantage of such
interface (less context switches, from userspace to kernel space), Rust
`Futures` can be composed together pretty well to facilitate that, but you have
to be careful!
The following code snippet, submits two I/O operations in one "batch", but
`io_uring` does not guarantee that the submission order = completion order!
@@ -176,7 +176,7 @@ To submit a batch while preserving operation order, one
must use the io_uring ch
## The state of Rust async runtimes ecosystem
-The problem is twofold: at the time of writing this blog post, there is no
Rust equivalent of the `Seastar` framework. That is unfortunate, because
`glommio` attempted to be one, but things changed: Glauber moved on to work on
[Turso](https://github.com/tursodatabase/turso), and the Datadog team does not
seem to be actively maintaining the runtime while building [a real-time
time-series storage engine in Rust for performance at
scale](https://www.datadoghq.com/blog/engineering/rust-times [...]
+The problem is twofold: at the time of writing this blog post, there is no
Rust equivalent of the `Seastar` framework. That is unfortunate because
`glommio` attempted to be one, but things changed: Glauber moved on to work on
[Turso](https://github.com/tursodatabase/turso), and the Datadog team does not
seem to be actively maintaining the runtime while building [a real-time
time-series storage engine in Rust for performance at
scale](https://www.datadoghq.com/blog/engineering/rust-timese [...]
Secundo problemo is that these runtimes imitate the `std` library APIs, which
is `POSIX` compliant, while many of `io_uring`'s most powerful features are
not, leaving those capabilities out of reach for us mere mortals. Request
chaining is only the tip of the iceberg, there is plenty more, for example
`oneshot` APIs for listen/recv, `registered buffers`, and so on. Ultimately,
`File`, `TcpListener`, and `TcpStream` are not the right abstractions. From the
point of view of `POSIX` complia [...]
@@ -296,6 +296,6 @@ Flush the data to disk on every batch write.
| 3,361 MB/s | 1.98 ms | 2.26 ms | 2.57 ms | 3.88 ms |
## Closing words
-Finally, even though we went into significant detail in this blog post, we
have only scratched the surface of what is possible, and several subsections
could easily be blog posts on their own. If you are interested in learning more
about thread-per-core shared-nothing design, check out the `Seastar` framework,
it is the SOTA in this space. For now we shift our attention to the [on-going
work on clustering](https://github.com/apache/iggy/releases/tag/server-0.7.0),
using [Viewstamped Repl [...]
+Finally, even though we went into significant detail in this blog post, we
have only scratched the surface of what is possible, and several subsections
could easily be blog posts on their own. If you are interested in learning more
about thread-per-core shared-nothing design, check out the `Seastar` framework,
it is the SOTA in this space. For now, we shift our attention to the [ongoing
work on clustering](https://github.com/apache/iggy/releases/tag/server-0.7.0),
using [Viewstamped Repl [...]
Stay tuned a deep-dive blog post on that is coming, and we’re just getting
started 🚀