wuchong commented on code in PR #1868: URL: https://github.com/apache/fluss/pull/1868#discussion_r2484368584
########## website/blog/releases/0.8.md: ########## @@ -0,0 +1,192 @@ +--- +title: "Apache Fluss 0.8: Streaming Lakehouse with Iceberg/Lance" +sidebar_label: "Announcing Apache Fluss 0.8" +authors: [giannis, jark] +date: 2025-10-30 +tags: [releases] +--- + + + +🌊 We are excited to announce the official release of **Fluss 0.8**! + +This is the first ASF release for Apache Fluss (incubating), marking a significant milestone in our journey to provide a robust streaming storage platform for real-time analytics. +Over the past four months, we’ve delivered lots of improvements and new capabilities, with more than 390+ commits, across the Streaming Lakehouse ecosystem, +including: deeper integration with Apache Flink, extensive improvements in the Streaming Lakehouse with support for [Apache Iceberg](https://iceberg.apache.org/) and [Lance](https://github.com/lancedb/lance), +and the introduction of [Delta Joins](https://cwiki.apache.org/confluence/display/FLINK/FLIP-486%3A+Introduce+A+New+DeltaJoin), which redefine efficiency in stream processing. + +Apache Fluss 0.8 marks a new era of **real-time**, **unified**, and **zero-state streaming**, designed to power the next generation of data platforms, focusing on performance, scalability, and simplicity of the overall architecture. + +<!-- truncate --> + + + +## Streaming Lakehouse for Iceberg + +A key highlight of Fluss 0.8 is the introduction of **Streaming Lakehouse for Apache Iceberg** ([FIP-3](https://cwiki.apache.org/confluence/display/FLUSS/FIP-3%3A+Support+tiering+Fluss+data+to+Iceberg)), +which transforms Iceberg from a batch-oriented table format into a continuously updating Lakehouse. Apache Fluss acts as the **real-time ingestion and storage layer**, writing fresh data and updates into Iceberg with guaranteed ordering and exactly-once semantics. + +This enables real-time data on Fluss to be tiered as Apache Iceberg tables, while providing table semantics like partitioning and bucketing on a single copy of data. +Moreover, it solves Iceberg’s long-standing update limitations through Fluss’s **native support for upserts and deletes** and its **built-in compaction service**, +which automatically merges small files and maintains optimized Iceberg snapshots. + +Key benefits include: +- **Unified Architecture**: Fluss handles sub-second streaming reads and writes, while Iceberg stores compacted historical data. +- **Native Updates and Deletes**: Fluss efficiently applies changes and tiers them into Iceberg without rewrite jobs. +- **Built-in Compaction Service**: The built-in service maintains snapshot efficiency with no external tooling. +- **Efficient Backfilling**: Enables lightning-fast backfill of historical data from Iceberg for streaming processing. +- **Lower Cost**: Reduce storage cost by tiering cold data to Iceberg while keeping hot data in Fluss, eliminating the need for duplicate storage. +- **Lower Latency**: Sub-second data freshness for Iceberg tables by Union Read from Fluss and Iceberg. + +```yaml title='server.yaml' +# Iceberg configuration +datalake.format: iceberg + +# the catalog config about Iceberg, assuming using Hadoop catalog, +datalake.iceberg.type: hadoop +datalake.iceberg.warehouse: /path/to/iceberg +``` + +You can find more detailed instructions in the [documentation](/docs/next/streaming-lakehouse/integrate-data-lakes/iceberg/). + +## Real-Time Multimodal AI Analytics with Lance + +Another major enhancement in Fluss 0.8 is the addition of **Streaming Lakehouse support for [Lance](https://github.com/lancedb/lance)** ([FIP-5](https://cwiki.apache.org/confluence/display/FLUSS/FIP-5%3A+Support+tiering+Fluss+data+to+Lance), +a modern columnar and vector-native data format designed for AI and machine learning workloads. +This integration extends Apache Fluss towards being a real-time ingestion platform for multi-modal data & AI, +not just traditional tabular streams, but also embeddings, vectors, and unstructured features used in AI systems. +With this release, Fluss can continuously ingest, update, and tier data into Lance tables with guaranteed ordering and freshness, +enabling fast synchronization between streaming pipelines and downstream ML or retrieval applications. + +Key benefits include: + +- **Unified multi-modal data ingestion**: Stream tabular, vector, and embedding data into Lance in real time. +- **AI/ML-ready storage**: Keep feature vectors and embeddings continuously up-to-date for model training or inference. +- **Low-latency analytics and retrieval**: Fast, continuous updates enable Lance data to be immediately usable for real-time search and recommendation. +- **Simplified architecture**: Eliminates complex ETL pipelines between streaming systems and vector databases. + +Seamless integration: combines Fluss’s high-throughput streaming engine with Lance’s efficient columnar persistence for consistent, multi-modal data management. + +```yaml title='server.yaml' +datalake.format: lance +datalake.lance.warehouse: s3://<bucket> +datalake.lance.endpoint: <endpoint> +datalake.lance.allow_http: true +datalake.lance.access_key_id: <access_key_id> +datalake.lance.secret_access_key: <secret_access_key> +``` + +See the [LanceDB blog post](https://lancedb.com/blog/fluss-integration/) for the full integration. You also can find more detailed instructions in the [documentation](/docs/next/streaming-lakehouse/integrate-data-lakes/lance/). + +## Flink 2.1 + +Apache Fluss is now fully compatible with **Apache Flink 2.1**, ensuring seamless integration with the latest Flink runtime and APIs. +This update strengthens Fluss’s role as a unified streaming storage layer, providing reliable performance and consistency for modern data pipelines built on Flink. + +### Delta Join Review Comment: Since this is listed under `Flink 2.1`, it is implicitly an integration feature with Flink. I’ll keep the title as is for conciseness. We’ve already noted, `This release introduces support for Delta Joins with Apache Flink`, so the context should be clear. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
