(hudi) branch asf-site updated: chore(site): add new blogs (#17993)

xushiyan Thu, 22 Jan 2026 22:54:38 -0800

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 1d321167ada1 chore(site): add new blogs (#17993)
1d321167ada1 is described below

commit 1d321167ada19103c709161245f09c468e154690
Author: Shiyan Xu <[email protected]>
AuthorDate: Fri Jan 23 00:54:20 2026 -0600

    chore(site): add new blogs (#17993)
---
 .github/scripts/validate-blog.py                   |   2 +-
 ...2026-01-13-apache-hudi-externalspillablemap.mdx |  14 +++
 ...2026-01-22-apache-hudi-at-applied-intuition.mdx | 102 +++++++++++++++++++++
 ...2026-01-13-apache-hudi-externalspillablemap.png | Bin 0 -> 91552 bytes
 .../img1.png                                       | Bin 0 -> 1747332 bytes
 .../img2.png                                       | Bin 0 -> 243340 bytes
 .../img3.png                                       | Bin 0 -> 93459 bytes
 .../img4.png                                       | Bin 0 -> 209711 bytes
 .../img5.png                                       | Bin 0 -> 123239 bytes
 .../img6.png                                       | Bin 0 -> 109286 bytes
 10 files changed, 117 insertions(+), 1 deletion(-)

diff --git a/.github/scripts/validate-blog.py b/.github/scripts/validate-blog.py
index 6eb5c854e7e4..a99570985499 100644
--- a/.github/scripts/validate-blog.py
+++ b/.github/scripts/validate-blog.py
@@ -31,7 +31,7 @@ ALLOWED_TAGS = {
     'observability', 'metadata', 'meetup', 'key generation', 'docker',
     'cleaner', 'apache hive', 'apache doris', 'vector search', 'upstox',
     'tla specification', 'streamlit', 'rag', 'presto', 'postgres',
-    'file sizing', 'etl', 'databricks', 'data warehouse',
+    'file sizing', 'etl', 'databricks', 'data warehouse', 'applied intuition',
     'conference', 'compaction', 'bootstrap', 'apache parquet', 'announcement',
     'zupee', 'yuno', 'yugabyte', 'yahoo', 'robinhood', 'peloton', 'leboncoin',
     'grofers', 'grab', 'funding circle', 'freewheel', 'estuary', 'alibaba',
diff --git a/website/blog/2026-01-13-apache-hudi-externalspillablemap.mdx 
b/website/blog/2026-01-13-apache-hudi-externalspillablemap.mdx
new file mode 100644
index 000000000000..5ac09d7ef8ea
--- /dev/null
+++ b/website/blog/2026-01-13-apache-hudi-externalspillablemap.mdx
@@ -0,0 +1,14 @@
+---
+title: "ExternalSpillableMap: Handle Maps Too Big for Memory"
+authors:
+- name: Yongkyun
+category: deep-dive
+image: /assets/images/blog/2026-01-13-apache-hudi-externalspillablemap.png
+tags:
+- performance
+- apache spark
+---
+
+import Redirect from '@site/src/components/Redirect';
+
+<Redirect 
url="https://codepointer.substack.com/p/apache-hudi-externalspillablemap";>Redirecting...
 please wait!! </Redirect>
diff --git a/website/blog/2026-01-22-apache-hudi-at-applied-intuition.mdx 
b/website/blog/2026-01-22-apache-hudi-at-applied-intuition.mdx
new file mode 100644
index 000000000000..1d6431f6e816
--- /dev/null
+++ b/website/blog/2026-01-22-apache-hudi-at-applied-intuition.mdx
@@ -0,0 +1,102 @@
+---
+title: "Scaling Autonomous Vehicle Data Infrastructure with Apache Hudi at 
Applied Intuition"
+excerpt: "How Applied Intuition reduced query times from 10 minutes to under 
25 seconds and achieved 20x storage compression by migrating to an Apache 
Hudi-powered data lakehouse."
+author: The Hudi Community
+category: case-study
+image: /assets/images/blog/2026-01-22-apache-hudi-at-applied-intuition/img1.png
+tags:
+- data lakehouse
+- applied intuition
+---
+
+---
+
+_This post summarizes Applied Intuition's talk from the Apache Hudi community 
sync. Watch the recording on 
[YouTube](https://www.youtube.com/watch?v=gWw6hOcM-Fg)._
+
+---
+
+![Talk title 
slide](/assets/images/blog/2026-01-22-apache-hudi-at-applied-intuition/img1.png)
+
+Applied Intuition is the foremost enabler of autonomous vehicle (AV) systems, 
providing a suite of tools that help AV companies improve their entire 
stack—from simulation to data exploration. To support their mission, Applied 
Intuition built a unique data infrastructure that is flexible, scalable, and 
secure. After migrating to an Apache Hudi-powered data lakehouse, they 
transformed their data capabilities: query times dropped from 10 minutes to 
under 25 seconds, and they can now query 3 [...]
+
+## Building a Unique Data Infrastructure
+
+Applied Intuition's data infrastructure is designed to meet the specific needs 
of its diverse customer base, including 17 of the top 20 OEMs. Their 
infrastructure is built around four core principles.
+
+First, schemas must be flexible. Each customer determines their own data 
schema, so the infrastructure must handle a wide variety of data points without 
requiring rigid upfront definitions.
+
+Second, compute needs to be tunable. Some customers are more cost-sensitive 
while others have larger-scale needs, so the infrastructure can adjust compute 
resources on a per-customer basis.
+
+Third, everything must be cloud agnostic. Because customers operate on 
different cloud providers, the infrastructure—built on Kubernetes—works 
seamlessly across all of them without relying on a single vendor.
+
+Finally, security and privacy are paramount. All data and infrastructure live 
within the customer's own cloud accounts. This ensures that customers fully own 
and control their data, enabling strict security, privacy, and retention 
policies.
+
+## The Challenges Before Apache Hudi
+
+<img 
src="/assets/images/blog/2026-01-22-apache-hudi-at-applied-intuition/img2.png" 
alt="Architecture before Apache Hudi" width="800"/>
+
+Before adopting Apache Hudi, Applied Intuition's data infrastructure directly 
queried a raw data lake on S3/ABFS using SQL engines. While this approach 
worked initially, significant issues emerged as scale increased.
+
+The system struggled to provide ACID transaction guarantees critical for data 
integrity. Storage costs kept climbing because storing all data in raw format 
was expensive. As small files accumulated, query performance degraded 
dramatically due to the I/O overhead of opening and closing countless files.
+
+To address these challenges, Applied Intuition adopted Apache Hudi, which 
introduced a transactional layer and metadata management to their data 
lake—transforming file system storage into a modern data lakehouse.
+
+Applied Intuition primarily uses 
[Copy-on-Write](https://hudi.apache.org/docs/table_types#copy-on-write-table) 
(COW) tables. While Hudi also offers 
[Merge-on-Read](https://hudi.apache.org/docs/table_types#merge-on-read-table) 
(MOR) tables for faster ingestion, their main priority is query performance. 
With COW tables, they achieve fast query execution while accepting slightly 
higher write latency.
+
+## Leveraging Hudi Features to Shape Data Architecture
+
+Applied Intuition leverages three core Hudi services to optimize its data 
infrastructure: file sizing, clustering, and metadata indexing.
+
+### File Sizing: Solving the Small File Problem
+
+File sizing was the very first reason they started using Hudi. The company 
runs thousands of simulations daily, each generating numerous small files. This 
led to the classic "small file problem"—Spark queries would spend significant 
time just opening and closing files to read metadata. Spark SQL performs best 
with files around 512MB, but simulation files are often just kilobytes.
+
+<img 
src="/assets/images/blog/2026-01-22-apache-hudi-at-applied-intuition/img3.png" 
alt="File sizing optimization" width="800"/>
+
+Hudi's file sizing service efficiently packs small files into optimally-sized 
files by analyzing previous commits and estimating the number of records per 
file. By combining countless kilobyte-sized files, Hudi drastically reduced I/O 
overhead and improved query performance. Their data now takes up 20x less space 
than with raw Parquet files, resulting in substantial S3 cost savings.
+
+Rohit recalls this as the first "aha moment" with Hudi: "We had less than a 
gigabyte of data, but query performance was really slow. When we first tried 
file sizing, performance improved dramatically—and we saw all our data fit 
within megabytes. It was really cool to see that level of compression and query 
performance just out of the box."
+
+### Clustering: Optimizing for Query Patterns
+
+Many of Applied Intuition's queries focus on specific chunks of data, or 
batches. Hudi's [clustering](https://hudi.apache.org/docs/clustering/) feature 
improves data co-location by arranging related records together, minimizing the 
number of files touched per query.
+
+<img 
src="/assets/images/blog/2026-01-22-apache-hudi-at-applied-intuition/img4.png" 
alt="Clustering optimization" width="800"/>
+
+For example, by clustering all data from a single "simulation run ID" into 
just one or two files, Hudi allows queries to avoid scanning thousands of 
files. This has led to massive improvements in query performance. Applied 
Intuition runs clustering jobs asynchronously to maintain low write latency 
while keeping query performance high.
+
+### Metadata Indexing: From Minutes to Seconds
+
+Before Hudi, loading data from raw cloud storage could take minutes, 
especially when listing millions of files. Hudi's [metadata 
indexing](https://hudi.apache.org/docs/metadata_indexing/) creates a file index 
that allows the dataframe to load in under two seconds—a huge UX improvement 
for their customers.
+
+<img 
src="/assets/images/blog/2026-01-22-apache-hudi-at-applied-intuition/img5.png" 
alt="Metadata indexing" width="800"/>
+
+Additionally, they use column stats indices, which store min/max values for 
key columns. When a query runs, Hudi uses these stats to skip irrelevant files 
that don't match the query criteria, enabling much faster lookups.
+
+### Extending Hudi for Schema Flexibility
+
+Given their wide customer base and evolving schema needs, Applied Intuition 
extended Hudi with two customizations: one to evict the cached file schema 
provider so mid-day schema updates are picked up during writes, and another to 
allow Parquet batching even when schemas differ across commits—common in 
simulation data where batches may have different columns.
+
+## Impact: Performance, Cost, and Scale
+
+The improvements with Hudi have been transformative. Applied Intuition can now 
query 3-4 orders of magnitude more data than before. Storage costs dropped 
significantly thanks to file packing that achieves 20x compression compared to 
raw Parquet files. Query times that once took 10 minutes now complete in under 
25 seconds, and dataframe initialization that used to take minutes now happens 
in seconds.
+
+Despite running on tight compute resources—just 1-2 machines running 
DeltaStreamer—their ingestion latency sits around 15 minutes. They can easily 
scale this by adding more compute when needed.
+
+## Next Steps: Moving Beyond PostgreSQL
+
+After successfully implementing Hudi on a few key tables, Applied Intuition is 
scaling Hudi to support their entire data lake architecture.
+
+<img 
src="/assets/images/blog/2026-01-22-apache-hudi-at-applied-intuition/img6.png" 
alt="PostgreSQL CDC architecture" width="800"/>
+
+A proof of concept integrates PostgreSQL CDC via Debezium into Kafka, which 
feeds into Hudi DeltaStreamer, replicating transactional data into a 
Hudi-powered lakehouse. This setup enables non-critical queries to shift away 
from PostgreSQL, reducing database load and improving overall product 
performance. It also opens up deeper analytical insights directly from the data 
lake for both internal teams and customers.
+
+The team worked through some initial setup challenges, resolving tombstone 
record handling through PostgreSQL and Debezium configuration updates.
+
+## Acknowledgments
+
+Applied Intuition is grateful for the incredible support from the Apache Hudi 
community, which has significantly improved their data infrastructure. The 
Onehouse team—Sivabalan, Ethan, and Nadine—has been particularly helpful, 
staying up on long night calls to help debug issues and ensure the team 
understood the product deeply. Nadine also provided ongoing support by 
answering questions on Slack.
+
+## Conclusion
+
+Applied Intuition's journey with Apache Hudi demonstrates how a modern data 
lakehouse platform can solve complex data infrastructure challenges while 
unlocking new levels of performance and insight.
diff --git 
a/website/static/assets/images/blog/2026-01-13-apache-hudi-externalspillablemap.png
 
b/website/static/assets/images/blog/2026-01-13-apache-hudi-externalspillablemap.png
new file mode 100644
index 000000000000..cd5b007d7d90
Binary files /dev/null and 
b/website/static/assets/images/blog/2026-01-13-apache-hudi-externalspillablemap.png
 differ
diff --git 
a/website/static/assets/images/blog/2026-01-22-apache-hudi-at-applied-intuition/img1.png
 
b/website/static/assets/images/blog/2026-01-22-apache-hudi-at-applied-intuition/img1.png
new file mode 100644
index 000000000000..8502285440bf
Binary files /dev/null and 
b/website/static/assets/images/blog/2026-01-22-apache-hudi-at-applied-intuition/img1.png
 differ
diff --git 
a/website/static/assets/images/blog/2026-01-22-apache-hudi-at-applied-intuition/img2.png
 
b/website/static/assets/images/blog/2026-01-22-apache-hudi-at-applied-intuition/img2.png
new file mode 100644
index 000000000000..c26e7c127b22
Binary files /dev/null and 
b/website/static/assets/images/blog/2026-01-22-apache-hudi-at-applied-intuition/img2.png
 differ
diff --git 
a/website/static/assets/images/blog/2026-01-22-apache-hudi-at-applied-intuition/img3.png
 
b/website/static/assets/images/blog/2026-01-22-apache-hudi-at-applied-intuition/img3.png
new file mode 100644
index 000000000000..4737535bb847
Binary files /dev/null and 
b/website/static/assets/images/blog/2026-01-22-apache-hudi-at-applied-intuition/img3.png
 differ
diff --git 
a/website/static/assets/images/blog/2026-01-22-apache-hudi-at-applied-intuition/img4.png
 
b/website/static/assets/images/blog/2026-01-22-apache-hudi-at-applied-intuition/img4.png
new file mode 100644
index 000000000000..8405f58afeaa
Binary files /dev/null and 
b/website/static/assets/images/blog/2026-01-22-apache-hudi-at-applied-intuition/img4.png
 differ
diff --git 
a/website/static/assets/images/blog/2026-01-22-apache-hudi-at-applied-intuition/img5.png
 
b/website/static/assets/images/blog/2026-01-22-apache-hudi-at-applied-intuition/img5.png
new file mode 100644
index 000000000000..b276825d3cb3
Binary files /dev/null and 
b/website/static/assets/images/blog/2026-01-22-apache-hudi-at-applied-intuition/img5.png
 differ
diff --git 
a/website/static/assets/images/blog/2026-01-22-apache-hudi-at-applied-intuition/img6.png
 
b/website/static/assets/images/blog/2026-01-22-apache-hudi-at-applied-intuition/img6.png
new file mode 100644
index 000000000000..0ba0629fe71b
Binary files /dev/null and 
b/website/static/assets/images/blog/2026-01-22-apache-hudi-at-applied-intuition/img6.png
 differ

(hudi) branch asf-site updated: chore(site): add new blogs (#17993)

Reply via email to