(hudi) branch asf-site updated: [DOCS] Final draft (#12503)

vinoth Tue, 17 Dec 2024 00:07:36 -0800

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 81a72bf8f48 [DOCS] Final draft (#12503)
81a72bf8f48 is described below

commit 81a72bf8f4880e95cbe52bb8c690c97907399485
Author: vinoth chandar <[email protected]>
AuthorDate: Tue Dec 17 00:07:21 2024 -0800

    [DOCS] Final draft (#12503)
---
 ...0-0.md => 2024-12-16-announcing-hudi-1-0-0.mdx} |  96 +++++++++++++--------
 .../static/assets/images/blog/dlms-hierarchy.png   | Bin 0 -> 215165 bytes
 .../images/blog/hudi-innovation-timeline.jpg       | Bin 0 -> 118407 bytes
 3 files changed, 62 insertions(+), 34 deletions(-)

diff --git a/website/blog/2024-12-16-announcing-hudi-1-0-0.md 
b/website/blog/2024-12-16-announcing-hudi-1-0-0.mdx
similarity index 97%
rename from website/blog/2024-12-16-announcing-hudi-1-0-0.md
rename to website/blog/2024-12-16-announcing-hudi-1-0-0.mdx
index 8d46173bc55..b06fe52e91c 100644
--- a/website/blog/2024-12-16-announcing-hudi-1-0-0.md
+++ b/website/blog/2024-12-16-announcing-hudi-1-0-0.mdx
@@ -3,7 +3,7 @@ title: "Announcing Apache Hudi 1.0 and the Next Generation of 
Data Lakehouses"
 excerpt: "game-changing major release, that reimagines Hudi and Data 
Lakehouses."
 author: Vinoth Chandar
 category: blog
-image: 
/assets/images/blog/non-blocking-concurrency-control/lsm_archive_timeline.png
+image: /assets/images/blog/dlms-hierarchy.png
 tags:
 - timeline
 - design
@@ -16,19 +16,29 @@ tags:
 
 ## Overview
 
-We are thrilled to announce the release of Apache Hudi 1.0, a landmark 
achievement for our vibrant community that defines what the next generation of 
data lakehouses should achieve. Hudi pioneered ***transactional data lakes*** 
in 2017, and today, we live in a world where this technology category is 
mainstream as the “***Data Lakehouse”***. The Hudi community has made several 
key, original, and first-of-its-kind contributions to this category, as shown 
below, compared to when other OSS a [...]
+We are thrilled to announce the release of Apache Hudi 1.0, a landmark 
achievement for our vibrant community that defines what the next generation of 
data lakehouses should achieve. Hudi pioneered ***transactional data lakes*** 
in 2017, and today, we live in a world where this technology category is 
mainstream as the “***Data Lakehouse”***. The Hudi community has made several 
key, original, and first-of-its-kind contributions to this category, as shown 
below, compared to when other OSS a [...]
 
-![][image1]
+<div style={{ textAlign: 'center' }}>
+  <img src="/assets/images/blog/hudi-innovation-timeline.jpg" alt="innovation 
timeline" />
+</div>
 
-This release is more than just a version increment—it advances the breadth of 
Hudi’s feature set and its architecture's robustness while bringing fresh 
innovation to shape the future. This post reflects on how technology and the 
surrounding ecosystem have evolved, making a case for a holistic “***Data 
Lakehouse Management System***” (***DLMS***) as the new Northstar. For most of 
this post, we will deep dive into the latest capabilities of Hudi 1.0 that make 
this evolution possible.
+This [release](/releases/release-1.0.0) is more than just a version 
increment—it advances the breadth of Hudi’s feature set and its architecture's 
robustness while bringing fresh innovation to shape the future. This post 
reflects on how technology and the surrounding ecosystem have evolved, making a 
case for a holistic “***Data Lakehouse Management System***” (***DLMS***) as 
the new Northstar. For most of this post, we will deep dive into the latest 
capabilities of Hudi 1.0 that make thi [...]
 
 ## Evolution of the Data Lakehouse
 
 Technologies must constantly evolve—[Web 
3.0](https://en.wikipedia.org/wiki/Web3), [cellular 
tech](https://en.wikipedia.org/wiki/List_of_wireless_network_technologies), 
[programming language 
generations](https://en.wikipedia.org/wiki/Programming_language_generations)—based
 on emerging needs. Data lakehouses are no exception. This section explores the 
hierarchy of such needs for data lakehouse users. The most basic need is the 
“**table format**” functionality, the foundation for data lake [...]
 
-However, the benefits of a format end there, and now a table format is just 
the tip of the iceberg. Users require an [end-to-end open data 
lakehouse](https://www.onehouse.ai/blog/open-table-formats-and-the-open-data-lakehouse-in-perspective),
 and modern data lakehouse features need a sophisticated layer of 
***open-source software*** operating on data stored in open table formats. For 
example, Optimized writers can balance cost and performance by carefully 
managing file sizes using the st [...]
+However, the benefits of a format end there, and now a table format is just 
the tip of the iceberg. Users require an [end-to-end open data 
lakehouse](https://www.onehouse.ai/blog/open-table-formats-and-the-open-data-lakehouse-in-perspective),
 and modern data lakehouse features need a sophisticated layer of 
***open-source software*** operating on data stored in open table formats. For 
example, Optimized writers can balance cost and performance by carefully 
managing file sizes using the st [...]
+
+
+<div style={{
+    textAlign: 'center',
+    width: '90%',
+    height: 'auto'
+}}>
+  <img src="/assets/images/blog/dlms-hierarchy.png" alt="dlms hierarchy" />
+</div>
 
-![][image2]
 
 Moving forward with 1.0, the community has 
[debated](https://github.com/apache/hudi/pull/8679) these key points and 
concluded that we need more open-source “**software capabilities**” that are 
directly comparable with DBMSes for two main reasons.
 
@@ -42,8 +52,15 @@ If combined, we would gain a powerful database built on top 
of the data lake(hou
 
 In Hudi 1.0, we’ve delivered a significant expansion of data lakehouse 
technical capabilities discussed above inside Hudi’s [storage 
engine](https://en.wikipedia.org/wiki/Database_engine) layer.  Storage engines 
(a.k.a database engines) are standard database components that sit on top of 
the storage/file/table format and are wrapped by the DBMS layer above, handling 
the core read/write/management functionality. In the figure below, we map the 
Hudi components with the seminal [Architectur [...]
 
-![Hudi Stack](/assets/images/hudi-stack-1-x.png)
-<p align = "center">Figure: Apache Hudi Database Architecture</p> 
+<div style={{
+    textAlign: 'center',
+    width: '80%',
+    height: 'auto'
+}}>
+  <img src="/assets/images/hudi-stack-1-x.png" alt="Hudi DB Architecture" />
+ <p align = "center">Figure: Apache Hudi Database Architecture</p>
+</div>
+
 
 Regarding full-fledged DLMS functionality, the closest experience Hudi 1.0 
offers is through Apache Spark. Users can deploy a Spark server (or Spark 
Connect) with Hudi 1.0 installed, submit SQL/jobs, orchestrate table services 
via SQL commands, and enjoy new secondary index functionality to speed up 
queries like a DBMS. Subsequent releases in the 1.x release line and beyond 
will continuously add new features and improve this experience.
 
@@ -51,15 +68,18 @@ In the following sections, let’s dive into what makes Hudi 
1.0 a standout rele
 
 ### New Time and Timeline
 
-For the familiar user, time is a key concept in Hudi. Hudi’s original notion 
of time was instantaneous, i.e., actions that modify the table appear to take 
effect at a given instant. This was limiting when designing features like 
non-blocking concurrency control across writers, which needs to reason about 
actions more as an “interval” to detect other conflicting actions. Every action 
on the Hudi timeline now gets a *requested* and a *completion* time; Thus, the 
timeline layout version has [...]
+For the familiar user, time is a key concept in Hudi. Hudi’s original notion 
of time was instantaneous, i.e., actions that modify the table appear to take 
effect at a given instant. This was limiting when designing features like 
non-blocking concurrency control across writers, which needs to reason about 
actions more as an “interval” to detect other conflicting actions. Every action 
on the Hudi timeline now gets a *requested* and a *completion* time; Thus, the 
timeline layout version has [...]
+
+<div style={{ textAlign: 'center' }}>
+  <img src="/assets/images/hudi-timeline-actions.png" alt="Timeline actions" />
+  <p align = "center">Figure: Showing actions in Hudi 1.0 modeled as an 
interval of two instants: requested and completed</p>
+</div>
 
-![Timeline actions](/assets/images/hudi-timeline-actions.png)
-<p align = "center">Figure: Showing actions in Hudi 1.0 modeled as an interval 
of two instants: requested and completed</p>
 
 Hudi tables are frequently updated, and users also want to retain a more 
extended action history associated with the table. Before Hudi 1.0, the older 
action history in a table was archived for audit access. But, due to the lack 
of support for cloud storage appends, access might become cumbersome due to 
tons of small files. In Hudi 1.0, we have redesigned the timeline as an [LSM 
tree](https://en.wikipedia.org/wiki/Log-structured_merge-tree), which is widely 
adopted for cases where good w [...]
 
 
-In the Hudi 1.0 release, the LSM timeline is heavily used in the query 
planning to map requested and completion times across Apache Spark, Apache 
Flink and Apache Hive. Future releases plan to leverage this to unify the 
timeline's active and history components, providing infinite retention of table 
history. Microbenchmarks show that the LSM timeline can be pretty efficient, 
even committing every ***30 seconds for 10 years with about 10M instants***, 
further cementing Hudi’s table format  [...]
+In the Hudi 1.0 release, the [LSM 
timeline](/docs/timeline#lsm-timeline-history) is heavily used in the query 
planning to map requested and completion times across Apache Spark, Apache 
Flink and Apache Hive. Future releases plan to leverage this to unify the 
timeline's active and history components, providing infinite retention of table 
history. Micro benchmarks show that the LSM timeline can be pretty efficient, 
even committing every ***30 seconds for 10 years with about 10M instants*** 
[...]
 
 | Number of actions | Instant Batch Size | Read cost (just times) | Read cost 
(along with action metadata) | Total file size |
 | :---- | :---- | :---- | :---- | :---- |
@@ -67,14 +87,21 @@ In the Hudi 1.0 release, the LSM timeline is heavily used 
in the query planning
 | 20000 | 10 | 51ms | 188ms | 16.8MB |
 | 10000000 | 1000 | 3400ms | 162s | 8.4GB |
 
-<p align = "center">Figure: Microbenchmark of LSM Timeline</p>
 
 ### Secondary Indexing for Faster Lookups
 
-Indexes are core to Hudi’s design, so much so that even the first 
pre-open-source version of Hudi shipped with 
[indexes](https://hudi.apache.org/docs/indexes#additional-writer-side-indexes) 
to speed up writes. However, these indexes were limited to the writer's side, 
except for record indexes in 0.14+ above, which were also integrated with Spark 
SQL queries. Hudi 1.0 generalizes indexes closer to the indexing functionality 
found in relational databases, supporting indexes on any secondar [...]
+Indexes are core to Hudi’s design, so much so that even the first 
pre-open-source version of Hudi shipped with 
[indexes](/docs/indexes#additional-writer-side-indexes) to speed up writes. 
However, these indexes were limited to the writer's side, except for record 
indexes in 0.14+ above, which were also integrated with Spark SQL queries. Hudi 
1.0 generalizes indexes closer to the indexing functionality found in 
relational databases, supporting indexes on any secondary column across both wr 
[...]
+
+<div style={{
+    textAlign: 'center',
+    paddingLeft: '10%',
+    width: '70%',
+    height: 'auto'
+}}>
+  <img src="/assets/images/hudi-stack-indexes.png" alt="Indexes" />
+  <p align = "center">Figure: the indexing subsystem in Hudi 1.0, showing 
different types of indexes</p>
+</div>
 
-![Indexes](/assets/images/hudi-stack-indexes.png)
-<p align = "center">Figure: the indexing subsystem in Hudi 1.0, showing 
different types of indexes</p>
 
 With secondary indexes, queries and DMLs scan a much-reduced amount of files 
from cloud storage, dramatically reducing costs (e.g., on engines like AWS 
Athena, which price by data scanned) and improving query performance for 
queries with low to even moderate amount of selectivity. On a benchmark of a 
query on *web\_sales* table (from ***10 TB tpc-ds dataset***), with file groups 
\- 286,603, total records \- 7,198,162,544 and cardinality of secondary index 
column in the \~ 1:150 ranges, w [...]
 
@@ -88,7 +115,7 @@ In Hudi 1.0, secondary indexes are only supported for Apache 
Spark, with planned
 
 ### Bloom Filter indexes
 
-Bloom filter indexes have existed on the Hudi writers for a long time. It is 
one of the most performant and versatile indexes users prefer for 
“needle-in-a-haystack” deletes/updates or de-duplication. The index works by 
storing special footers in base files around min/max key ranges and a dynamic 
bloom filter that adapts to the file size and can automatically handle 
partitioning/skew on the writer's path. Hudi 1.0 introduces a newer kind of 
bloom filter index for Spark SQL while retainin [...]
+Bloom filter indexes have existed on the Hudi writers for a long time. It is 
one of the most performant and versatile indexes users prefer for 
“needle-in-a-haystack” deletes/updates or de-duplication. The index works by 
storing special footers in base files around min/max key ranges and a dynamic 
bloom filter that adapts to the file size and can automatically handle 
partitioning/skew on the writer's path. Hudi 1.0 introduces a newer kind of 
bloom filter index for Spark SQL while retainin [...]
 
 ```sql
 -- Create a bloom filter index on the driver column of the table `hudi_table`
@@ -103,16 +130,19 @@ In future releases of Hudi, we aim to fully integrate the 
benefits of the older
 
 An astute reader may have noticed above that the indexing is supported on a 
function/expression on a column. Hudi 1.0 introduces expression indexes similar 
to 
[Postgres](https://www.postgresql.org/docs/current/indexes-expressional.html) 
to generalize a two-decade-old relic in the data lake ecosystem \- 
partitioning\! At a high level, partitioning on the data lake divides the table 
into folders based on a column or a mapping function (partitioning function). 
When queries or operations are [...]
 
-![Indexes](/assets/images/expression-index-date-partitioning.png)
-<p align = "center">Figure: Shows index on a date expression when a different 
column physically partitions data</p>
 
-Hudi 1.0 treats partitions as a coarse-grained index on a column value or an 
expression of a column, as they should have been. To support the efficiency of 
skipping entire storage paths/folders, Hudi 1.0 introduces partition stats 
indexes that aggregate these statistics on the storage partition path level, in 
addition to doing so at the file level. Now, users can create different types 
of indexes on columns to achieve the effects of partitioning in a streamlined 
fashion using fewer conce [...]
+<div style={{ textAlign: 'center' }}>
+  <img src="/assets/images/expression-index-date-partitioning.png" 
alt="Timeline actions" />
+  <p align = "center">Figure: Shows index on a date expression when a 
different column physically partitions data</p>
+</div>
+
+Hudi 1.0 treats partitions as a [coarse-grained 
index](/docs/sql_queries#query-using-column-stats-expression-index) on a column 
value or an expression of a column, as they should have been. To support the 
efficiency of skipping entire storage paths/folders, Hudi 1.0 introduces 
partition stats indexes that aggregate these statistics on the storage 
partition path level, in addition to doing so at the file level. Now, users can 
create different types of indexes on columns to achieve the eff [...]
 
 ### Efficient Partial Updates
 
-Managing large-scale datasets often involves making fine-grained changes to 
records. Hudi has long supported [partial 
updates](https://hudi.apache.org/docs/0.15.0/record_payload#partialupdateavropayload)
 to records via the record payload interface. However, this usually comes at 
the cost of sacrificing engine-native performance by moving away from specific 
objects used by engines to represent rows. As users have embraced Hudi for 
incremental SQL pipelines on top of dbt/Spark or Flink Dyn [...]
+Managing large-scale datasets often involves making fine-grained changes to 
records. Hudi has long supported [partial 
updates](/docs/0.15.0/record_payload#partialupdateavropayload) to records via 
the record payload interface. However, this usually comes at the cost of 
sacrificing engine-native performance by moving away from specific objects used 
by engines to represent rows. As users have embraced Hudi for incremental SQL 
pipelines on top of dbt/Spark or Flink Dynamic Tables, there was  [...]
 
-Partial updates improve query and write performance simultaneously by reducing 
write amplification for writes and the amount of data read by Merge-on-Read 
snapshot queries. It also achieves much better storage utilization due to fewer 
bytes stored and improved compute efficiency over existing partial update 
support by retaining vectorized engine-native processing. Using the 1TB 
Brooklyn benchmark for write performance, we observe about **2.6x** improvement 
in Merge-on-Read query performa [...]
+Partial updates improve query and write performance simultaneously by reducing 
write amplification for writes and the amount of data read by Merge-on-Read 
snapshot queries. It also achieves much better storage utilization due to fewer 
bytes stored and improved compute efficiency over existing partial update 
support by retaining vectorized engine-native processing. Using the 1TB 
Brooklyn benchmark for write performance, we observe about **2.6x** improvement 
in Merge-on-Read query performa [...]
 
 |  | Full Record Update | Partial Update | Gains |
 | :---- | :---- | :---- | :---- |
@@ -120,8 +150,6 @@ Partial updates improve query and write performance 
simultaneously by reducing w
 | **Bytes written (GB)** | 891.7 | 12.7 | 70.2x |
 | **Query latency (s)** | 164 | 29 | 5.7x |
 
-<p align = "center">Figure: Second benchmark for partial updates, 1TB MOR 
table, 1000 partitions, 80% random updates. 3/100 columns randomly updated</p>
-
 This also lays the foundation for managing unstructured and multimodal data 
inside a Hudi table and supporting [wide 
tables](https://github.com/apache/hudi/pull/11733) efficiently for machine 
learning use cases.
 
 ### Merge Modes and Custom Mergers
@@ -131,7 +159,7 @@ One of the most unique capabilities Hudi provides is how it 
helps process stream
 ![event time ordering](/assets/images/event-time-ordering-merge-mode.png)
 <p align = "center">Figure: Shows EVENT\_TIME\_ORDERING where merging 
reconciles state based on the highest event\_time</p>
 
-Prior Hudi versions supported this functionality through the record payload 
interface with built-in support for a pre-combine field on the default 
payloads. Hudi 1.0 makes these two styles of processing and merging changes 
first class by introducing merge modes within Hudi.
+Prior Hudi versions supported this functionality through the record payload 
interface with built-in support for a pre-combine field on the default 
payloads. Hudi 1.0 makes these two styles of processing and merging changes 
first class by introducing [merge modes](/docs/record_merger) within Hudi.
 
 | Merge Mode | What does it do? |
 | :---- | :---- |
@@ -147,10 +175,10 @@ We have expressed dissatisfaction with the optimistic 
concurrency control approa
 
 Hudi 1.0 introduces a new **non-blocking concurrency control (NBCC)** designed 
explicitly for data lakehouse workloads, using years of experience gained 
supporting some of the largest data lakes on the planet in the Hudi community. 
NBCC enables simultaneous writing from multiple writers and compaction of the 
same record without blocking any involved processes. This is achieved by simply 
lightweight distributed locks and TrueTime semantics discussed above. (see 
[RFC-66](https://github.com [...]
 
-![NBCC](/assets/images/nbcc_partial_updates.gif)
-<p align = "center">
-Figure: Two streaming jobs in action writing to the same records concurrently 
on different columns.
-</p>
+<div style={{ textAlign: 'center' }}>
+  <img src="/assets/images/nbcc_partial_updates.gif" alt="NBCC" />
+  <p align = "center">Figure: Two streaming jobs in action writing to the same 
records concurrently on different columns.</p>
+</div>
 
 NBCC operates with streaming semantics, tying together concepts from previous 
sections. Data necessary to compute table updates are emitted from an upstream 
source, and changes and partial updates can be merged in any of the merge modes 
above. For example, in the figure above, two independent Flink jobs enrich 
different table columns in parallel, a pervasive pattern seen in stream 
processing use cases. Check out this 
[blog](https://hudi.apache.org/blog/2024/12/06/non-blocking-concurrency [...]
 
@@ -161,13 +189,13 @@ If you are wondering: “All of this sounds cool, but how 
do I upgrade?” we ha
 ![Indexes](/assets/images/backwards-compat-writing.png)
 <p align = "center">Figure: 4-step process for painless rolling upgrades to 
Hudi 1.0</p>
 
-Hudi 1.0 introduces backward-compatible writing to achieve this in 4 steps, as 
described above. Hudi 1.0 also automatically handles any checkpoint translation 
necessary as we switch to completion time-based processing semantics for 
incremental and CDC queries. The Hudi metadata table has to be temporarily 
disabled during this upgrade process but can be turned on once the upgrade is 
completed successfully. Please read the [release 
notes](https://hudi.apache.org/releases/release-1.0.0) car [...]
+Hudi 1.0 introduces backward-compatible writing to achieve this in 4 steps, as 
described above. Hudi 1.0 also automatically handles any checkpoint translation 
necessary as we switch to completion time-based processing semantics for 
incremental and CDC queries. The Hudi metadata table has to be temporarily 
disabled during this upgrade process but can be turned on once the upgrade is 
completed successfully. Please read the [release 
notes](/releases/release-1.0.0) carefully to plan your migration.
 
 ## What’s Next?
 
 Hudi 1.0 is a testament to the power of open-source collaboration. This 
release embodies the contributions of 60+ developers, maintainers, and users 
who have actively shaped its roadmap. We sincerely thank the Apache Hudi 
community for their passion, feedback, and unwavering support.
 
-The release of Hudi 1.0 is just the beginning. Our current 
[roadmap](https://hudi.apache.org/roadmap/) includes exciting developments 
across the following planned releases:
+The release of Hudi 1.0 is just the beginning. Our current [roadmap](/roadmap) 
includes exciting developments across the following planned releases:
 
 * **1.0.1**: First bug fix, patch release on top of 1.0, which hardens the 
functionality above and makes it easier. We intend to publish additional patch 
releases to aid migration to 1.0 as the bridge release for the community from 
0.x.
 * **1.1**:  Faster writer code path rewrite, new indexes like bitmap/vector 
search, granular record-level change encoding, Hudi storage engine APIs, 
abstractions for cross-format interop.
@@ -180,9 +208,9 @@ Hudi releases are drafted collaboratively by the community. 
If you don’t see s
 
 Are you ready to experience the future of data lakehouses? Here’s how you can 
dive into Hudi 1.0:
 
-* Documentation: Explore Hudi’s 
[Documentation](https://hudi.apache.org/docs/overview) and learn the 
[concepts](https://hudi.apache.org/docs/hudi_stack).
-* Quickstart Guide: Follow the [Quickstart 
Guide](https://hudi.apache.org/docs/quick-start-guide) to set up your first 
Hudi project.
-* Upgrading from a previous version?  Follow the [migration 
guide](https://hudi.apache.org/releases/release-1.0.0#migration-guide) and 
contact the Hudi OSS community for help.
+* Documentation: Explore Hudi’s [Documentation](/docs/overview) and learn the 
[concepts](/docs/hudi_stack).
+* Quickstart Guide: Follow the [Quickstart Guide](/docs/quick-start-guide) to 
set up your first Hudi project.
+* Upgrading from a previous version?  Follow the [migration 
guide](/releases/release-1.0.0#migration-guide) and contact the Hudi OSS 
community for help.
 * Join the Community: Participate in discussions on the [Hudi Mailing 
List](https://hudi.apache.org/community/get-involved/), 
[Slack](https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g)
 and [GitHub](https://github.com/apache/hudi/issues).
 * Follow us on social media: 
[Linkedin](https://www.linkedin.com/company/apache-hudi/?viewAsMember=true), 
[X/Twitter](https://twitter.com/ApacheHudi).
 
diff --git a/website/static/assets/images/blog/dlms-hierarchy.png 
b/website/static/assets/images/blog/dlms-hierarchy.png
new file mode 100644
index 00000000000..ee8f22afe0c
Binary files /dev/null and 
b/website/static/assets/images/blog/dlms-hierarchy.png differ
diff --git a/website/static/assets/images/blog/hudi-innovation-timeline.jpg 
b/website/static/assets/images/blog/hudi-innovation-timeline.jpg
new file mode 100644
index 00000000000..4bd114adfe1
Binary files /dev/null and 
b/website/static/assets/images/blog/hudi-innovation-timeline.jpg differ

(hudi) branch asf-site updated: [DOCS] Final draft (#12503)

Reply via email to