GitHub user wuchong edited a discussion: Apache Fluss Roadmap 2026

This roadmap outlines the strategic direction and key focus areas for the 
Apache Fluss community in 2026. It highlights both ongoing initiatives and 
planned features designed to solidify Fluss as the **Streaming Storage for 
Real-Time Analytics & AI**.

> **Disclaimer**: This document reflects our current vision and priorities. It 
> is not a strict delivery commitment. We adapt our plans based on community 
> feedback and the evolving data landscape.

---

### 📢 We Need Your Voice

If you are running Fluss in production or evaluating it for your next-gen 
architecture, your feedback is critical. Please help us prioritize by letting 
us know:

* **Blockers:** What is missing that prevents you from deploying Fluss today?
* **Priorities:** Which feature below would provide the most immediate value to 
your team?
* **AI & Lakehouse:** Which specific integration do you expect the most?

---

### 🧠 Pillar 1: Real-Time AI & ML

*Transforming Fluss into the backbone for real-time AI & ML.*

- [ ] **Real-Time Feature Store**: Native support for feature engineering 
primitives, including:
  - [ ]  **Aggregation Merge Engines** for real-time and long-period features
  - [ ] **Zero-Cost Schema Evolution** for safely ML model iteration
  - [ ] **Point-in-Time Correctness** for training data.
- [ ] **Multimodal Streaming Data**: row-based, column-based, vector-based, 
variant, images, etc.
- [ ] **Python AI Ecosystem**: A high-performance Python SDK designed for AI/ML 
workflows, enabling seamless integration with PyTorch, Ray, Pandas, and PyArrow 
for direct access to both real-time streams and historical snapshots.

### 🌊 Pillar 2: The Real-Time Lakehouse

*Bridging the gap between Streaming and the Lakehouse with sub-second latency.*

- [ ] **Union Read Ecosystem**: Native Union Read support for Spark, Trino, and 
StarRocks, etc. 
- [ ] **Performance**: Implementation of **Deletion Vectors** to significantly 
accelerate Union Reads for updates and deletes.
- [ ] **Seamless Data Lake**: Capability to define Fluss tables directly on top 
of existing Lakehouse tables. 

### 🚀 Pillar 3: Streaming Analytics (Flink & Spark)

*Deep integration to provide a "Single Engine Experience" for unified batch and 
stream processing.*

- [ ] **Global Secondary index**: Enabling fast lookups on any non-primary key 
columns.
- [ ] **Delta Join**: General Availability of Delta Joins, powered by new 
Secondary Index capabilities.
- [ ] **Real-Time Profiles**: Native support for **RoaringBitmap** aggregation 
and auto-increment keys to power User/Item Profile use cases.
- [ ] **Optimizer**: Moving from RBO to CBO in Flink SQL by leveraging Fluss 
table statistics.
- [ ] **Spark Engine**: Full Spark Engine support, including Catalog, Streaming 
Read/Write, and Batch Read/Write.
- [ ] **Spark RTM**: Integration with Spark 4.1 Structured Streaming Real-Time 
Mode.

### 💾 Pillar 4: Storage Engine

- [ ] **Columnar Streaming**: comprehensive support for Filter Pushdown (#197) 
and Aggregation Pushdown to minimize data transfer.
- [ ] **Schema Evolution**: Full support for adding new columns, altering 
existing columns, and dropping columns.


### ☁️ Pillar 5: Cloud-Native Architecture

*Simplifying operations and reducing TCO.*

- [ ] **ZooKeeper Removal**: Eliminating the ZK dependency to simplify 
deployment and reduce operational complexity.
- [ ] **Zero Disks**: "Direct write to S3" to make storage nodes diskless and 
elastic and eliminate inter-zone networking costs.


### 🔌 Pillar 6: Connectivity & Ingestion

*Making it easier to get data in and out of Fluss.*

- [ ] **Log Ingestion Ecosystem**: Direct integration with popular agents like 
Logstash, Vector, Filebeat, Fluentd. Or support Kafka compatibility to allow 
existing log collectors to write to Fluss without code changes.
- [ ] **Client SDK Expansion**: Rust Client, C++ Client, Python Client.


### 🛠️ Pillar 7: Operational Excellence

*Features to make Day-2 operations seamless.*

- [ ] **Cluster Rebalance**: Automated tools to rebalance data distribution 
across nodes.
- [ ] **Rescale Tables**: Bucket rescaling for Log Tables and Primary Key 
Tables.

GitHub link: https://github.com/apache/fluss/discussions/2342

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to