fresh-borzoni commented on code in PR #2934: URL: https://github.com/apache/fluss/pull/2934#discussion_r2996308568
########## website/blog/2026-03-25-fluss-rust-sdk.md: ########## @@ -0,0 +1,149 @@ +--- +slug: fluss-rust-sdk +title: "Why Apache Fluss Chose Rust for Its Multi-Language SDK" +authors: [yuxia, keithlee, anton] +image: ./assets/fluss_rust/banner.jpg +--- + + + +If you maintain a data system that only speaks Java, you will eventually hear from someone who doesn't. A Python team building a feature store. A C++ service that needs sub-millisecond writes. An AI agent that wants to call your system through a tool binding. They all need the same capabilities (writes, reads, lookups) and none of them want to spin up a JVM to get them. + +Apache Fluss, streaming storage for real-time analytics and AI, hit this exact inflection point. The [Java client](/blog/fluss-java-client) works well for Flink-based compute, where the JVM is already the world you live in. But outside that world, asking consumers to run a JVM sidecar just to write a record or look up a key creates friction that compounds across every service, every pipeline, every agent in the stack. + +We could have written a separate client for each language. Maintain five copies of the wire protocol, five implementations of the batching logic, five sets of retry semantics and idempotence tracking. That path scales linearly with languages and ends predictably: the Java client gets features first, the Python client gets them six months later with slightly different edge-case behavior, and the C++ client is perpetually "almost done." + +We took a different path and tried to leverage the lessons of the great. + +<!-- truncate --> + +## The librdkafka Model + +If you've worked with Kafka clients outside of Java, you've probably used [librdkafka](https://github.com/confluentinc/librdkafka) without knowing it. It's a single C library that powers `confluent-kafka-python`, `confluent-kafka-go`, and others. One core handles the wire protocol, batching, memory management, and delivery semantics. Each language binding is a thin wrapper, a glue on top of a battle-tested engine. + +The model is elegant because it inverts the usual maintenance equation. Instead of N full client implementations that diverge over time, each developing its own bugs, its own subtle behavioral differences, its own backlog of features the Java client has but the Python client doesn't yet, you get one implementation and N thin bindings that stay in sync by construction. A bug gets fixed once, and every language picks it up on the next build. + +The deeper benefit is correctness, not just code reuse. When you maintain three separate implementations of a client protocol, behavioral drift is inevitable. Edge cases in retry logic, subtle differences in how backpressure kicks in, inconsistencies in how idempotent writes handle sequence numbers. These are the bugs that don't show up in unit tests but surface in production under load, and they surface differently in each language. + +We built fluss-rust on this same idea. A single Rust core implements the full Fluss client protocol (Protobuf-based RPC, record batching with backpressure, background I/O, Arrow serialization, idempotent writes, SASL authentication) and exposes it to three languages: + +- **Rust**: directly, as the `fluss-rs` crate +- **Python**: via [PyO3](https://pyo3.rs), the Rust-Python bridge +- **C++**: via [CXX](https://cxx.rs), the Rust-C++ bridge + +To give a sense of proportion: the Rust core is roughly 40k lines, while the Python binding is around 5k and the C++ binding around 6k. The bindings handle type conversion, async runtime bridging, and memory ownership at the language boundary, but all the protocol logic, batching, Arrow codec, and retry handling live in the shared core. + +## Why Rust and Not C + +C would have been the obvious choice. librdkafka already proves the model works at enormous scale, and the C ABI is the universal language of foreign function interfaces. + +We chose Rust, and the reason is specific rather than philosophical: compile-time safety is a force multiplier for a small team maintaining a shared core that multiple languages depend on. + +Getting memory safety right in C means manual lifetime tracking, careful code review, and years of experience knowing where the subtle bugs hide. Rust gives the same zero-overhead profile (no garbage collector, no runtime) but checks those invariants at compile time instead. + +To make this concrete: the Fluss write path moves ownership from the caller through a concurrent map, into a background event loop, and back out to futures the caller may or may not still be holding. In C, getting that right is a matter of discipline, and getting it wrong means segfaults that only reproduce under production load. In Rust, the borrow checker and the `Send`/`Sync` traits catch those problems before the code ever runs. + +We're not alone here. Polars, Apache OpenDAL, and delta-rs all chose Rust as a shared core with language bindings on top. Fluss's Rust SDK sits in that lineage. + +## Relationship with the Java Client + +The Java client remains the primary integration point for Flink, powering the SQL connector, the DataStream API, and the tightest path for JVM-based streaming compute. If your workload is Flink reading from and writing to Fluss, that's still the right client to use. + +fluss-rust isn't trying to replace it. It exists for the consumers the Java client was never designed to serve: Python pipelines, C++ services, Rust applications, and anything else that doesn't want a JVM in the process. Both clients talk the same wire protocol and get the same server-side behavior. Most teams will end up using both, the Java client for Flink and fluss-rust for everything around it. + +## What the Rust Core Covers + +The Rust core implements the complete Fluss client protocol. Here's how the pieces fit together. + +When you write a record, the call is synchronous: the record gets queued into a per-bucket batch without touching the network. A background sender task picks up ready batches and ships them as RPCs to the responsible TabletServers. This follows the same pattern as both the Fluss Java client and Kafka producers. + +The caller gets back a `WriteResultFuture`. Await it to block until the server confirms, or drop it for fire-and-forget. Either way, the server acknowledges the write with acks=all by default, so dropping the future skips the client-side wait, not the durability guarantee. Review Comment: well, it's on caller and how they track and advance source, but I added a better wording -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
