(hudi) branch asf-site updated: [BLOG] Modernizing Data Infrastructure at Peloton Using Apache Hudi (#13561)

xushiyan Wed, 16 Jul 2025 07:58:24 -0700

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 2d8f2abe5c22 [BLOG] Modernizing Data Infrastructure at Peloton Using 
Apache Hudi (#13561)
2d8f2abe5c22 is described below

commit 2d8f2abe5c22803bc8954de34cffa8f6c5cdf41c
Author: Dipankar Mazumdar <[email protected]>
AuthorDate: Wed Jul 16 10:56:19 2025 -0400

    [BLOG] Modernizing Data Infrastructure at Peloton Using Apache Hudi (#13561)
---
 ...025-07-15-modernizing-datainfra-peloton-hudi.md | 168 +++++++++++++++++++++
 .../pel_fig1.png                                   | Bin 0 -> 467010 bytes
 .../pel_fig2.png                                   | Bin 0 -> 443684 bytes
 .../pel_fig3.png                                   | Bin 0 -> 641606 bytes
 .../pel_fig4.png                                   | Bin 0 -> 616117 bytes
 .../pel_fig5.png                                   | Bin 0 -> 349514 bytes
 .../pel_fig6.png                                   | Bin 0 -> 396177 bytes
 .../pel_fig7.png                                   | Bin 0 -> 738333 bytes
 .../peloton-1200x600.jpg                           | Bin 0 -> 1015296 bytes
 9 files changed, 168 insertions(+)

diff --git a/website/blog/2025-07-15-modernizing-datainfra-peloton-hudi.md 
b/website/blog/2025-07-15-modernizing-datainfra-peloton-hudi.md
new file mode 100644
index 000000000000..a39d5fd478cd
--- /dev/null
+++ b/website/blog/2025-07-15-modernizing-datainfra-peloton-hudi.md
@@ -0,0 +1,168 @@
+---
+title: "Modernizing Data Infrastructure at Peloton Using Apache Hudi"
+excerpt: "How Peloton's Data Platform team scaled their data infrastructure 
using Hudi"
+author: Amaresh Bingumalla, Thinh Kenny Vu, Gabriel Wang, Arun Vasudevan in 
collaboration with Dipankar Mazumdar
+category: blog
+image: 
/assets/images/blog/2025-07-15-modernizing-datainfra-peloton-hudi/peloton-1200x600.jpg
+tags:
+- Apache Hudi
+- Peloton
+- Community
+---
+
+:::tip TL;DR
+
+Peloton re-architected its data platform using Apache Hudi to overcome 
snapshot delays, rigid service coupling, and high operational costs. By 
adopting CDC-based ingestion from PostgreSQL and DynamoDB, moving from CoW to 
MoR tables, and leveraging asynchronous services with fine-grained schema 
control, Peloton achieved 10-minute ingestion cycles, reduced compute/storage 
overhead, and enabled time travel and GDPR compliance.
+
+:::
+
+
+Peloton is a global interactive fitness platform that delivers connected, 
instructor-led fitness experiences to millions of members worldwide. Known for 
its immersive classes and cutting-edge equipment, Peloton combines software, 
hardware, and data to create personalized workout journeys. With a growing 
member base and increasing product diversity, data has become central to how 
Peloton delivers value. The *Data Platform* team at Peloton is responsible for 
building and maintaining the co [...]
+
+
+## The Challenge: Data Growth, Latency, and Operational Bottlenecks 
+
+As Peloton evolved into a global interactive fitness platform, its data 
infrastructure was challenged by the growing need for timely insights, agile 
service migrations, and cost-effective analytics. Daily operations, 
recommendation systems, and compliance requirements demanded an architecture 
that could support near real-time access, high-frequency updates, and scalable 
service boundaries.
+
+However, the team faced persistent bottlenecks with the existing setup:
+
+* Reporting pipelines were gated by the completion of full snapshot jobs.  
+* Recommender systems could only function on daily refreshed datasets.  
+* The analytics platform was tightly coupled with operational systems.  
+* Microservice migrations were constrained to all-at-once shifts.  
+* Database read replicas incurred high infrastructure costs.
+
+These limitations made it difficult to meet SLA expectations, scale workloads 
efficiently, and adapt the platform to new user and product needs.
+
+## The Legacy Architecture
+
+<img 
src="/assets/images/blog/2025-07-15-modernizing-datainfra-peloton-hudi/pel_fig1.png"
 alt="challenge" width="1000" align="middle"/>
+
+Peloton's earlier architecture relied on daily snapshots from a monolithic 
**PostgreSQL** database. The analytics systems would consume these snapshots, 
often waiting hours for completion. This not only delayed reporting but also 
introduced downstream rigidity.
+
+Because the same data platform supported both online and analytical workloads, 
any schema or service migration required significant planning and coordination. 
Database read replicas, used to scale reads, increased cost overhead. Moreover, 
recommendation systems that depended on data freshness were constrained by the 
snapshot interval, limiting personalization capabilities. This architecture 
struggled to support a fast-moving product roadmap, near real-time analytics, 
and the data agility [...]
+
+## Reimagining the Data Platform with Apache Hudi
+
+<img 
src="/assets/images/blog/2025-07-15-modernizing-datainfra-peloton-hudi/pel_fig2.png"
 alt="challenge" width="1000" align="middle"/>
+
+To address these challenges, the data platform team introduced Apache Hudi as 
the foundation of its modern data lake. The architecture was rebuilt to support 
Change Data Capture (CDC) ingestion from both PostgreSQL and DynamoDB using 
Debezium, with Kafka acting as the transport layer. A custom-built Hudi writer 
was developed to ingest CDC records into S3 using Apache Spark on EMR (version 
6.12.0 with Hudi 0.13.1).
+
+Peloton initially chose Copy-on-Write (CoW) table formats to support querying 
via Redshift Spectrum and simplify adoption. However, performance and cost 
bottlenecks prompted a transition to Merge-on-Read (MoR) tables with 
asynchronous table services for cleaning and compaction.
+
+Key architectural enhancements included:
+
+* **Support for GDPR compliance** through structured delete propagation.  
+* **Time travel queries** for recommender model training and data recovery.  
+* **Phased migration support** for microservices via decoupled ingestion.
+
+Peloton's broader data platform tech stack supports this architecture with a 
range of tools for orchestration, analytics, and governance. This includes EMR 
for compute, Redshift for querying, DBT for data transformations, Looker for BI 
and visualization, Airflow for orchestration, and DataHub for metadata 
management. These components complement Apache Hudi in forming a modular and 
production-ready lakehouse stack.
+
+<img 
src="/assets/images/blog/2025-07-15-modernizing-datainfra-peloton-hudi/pel_fig3.png"
 alt="challenge" width="1000" align="middle"/>
+
+## Learnings from Running Hudi at Scale
+
+With Hudi now integrated into Peloton's data lake, the team began to observe 
and address new operational and architectural challenges that emerged at scale. 
This section outlines the major lessons learned while maintaining 
high-ingestion throughput, ensuring data reliability, and keeping 
infrastructure costs under control.
+
+### CoW vs MoR: Performance Trade-offs
+
+Initially, Copy-on-Write (CoW) tables were chosen to simplify deployment and 
ensure compatibility with Redshift Spectrum. However, as ingestion frequency 
increased and update volumes spanned hundreds of partitions, performance became 
a bottleneck. Some high-frequency tables with updates across 256 partitions 
took nearly an hour to process per run. Additionally, retaining 30 days of 
commits for training recommender models significantly inflated storage 
requirements, reaching into the hund [...]
+
+<img 
src="/assets/images/blog/2025-07-15-modernizing-datainfra-peloton-hudi/pel_fig4.png"
 alt="challenge" width="1000" align="middle"/>
+
+To resolve this, the team migrated to Hudi’s Merge-on-Read (MoR) tables and 
reduced commit retention to 7 days. With ingestion jobs now running every 10 
minutes, latency dropped significantly, and storage and compute usage became 
more efficient.
+
+### Async vs Inline Table Services
+
+To improve write throughput and meet low-latency ingestion goals, the Peloton 
team initially configured Apache Hudi with asynchronous cleaner and compactor 
services. This approach worked well across most tables, allowing ingestion 
pipelines to run every 10 minutes with minimal blocking but introduced some 
operational edge cases. Some of the challenges encountered included:
+
+* Concurrent execution of writer and cleaner jobs, leading to conflicts. These 
were mitigated by introducing DynamoDB-based locks to serialize access.  
+* Reader-cleaner race conditions, where time travel queries intermittently 
failed with `"File Not Found"` errors \- traced back to cleaners deleting files 
mid-read.  
+* Compaction disruptions caused by EMR node terminations, which led to 
orphaned files when jobs failed mid-way.
+
+These edge cases were largely due to the operational complexity of managing 
concurrent workloads at Peloton’s scale. After weighing reliability against 
latency, the team opted to switch to inline table services for compaction and 
cleaning, augmented with custom logic to control when these actions would run. 
This change improved system stability while maintaining acceptable latency 
trade-offs.
+
+### Glue Schema Version Limits
+
+As schema evolution continued, the team used Hudi's `META_SYNC_ENABLED` to 
sync schema updates with AWS Glue. Over time, high-frequency schema updates 
pushed the number of `TABLE_VERSION` resources in Glue beyond the *1 million* 
limit. This caused jobs to fail in ways that were initially difficult to trace.
+
+One such failure manifested as the following error:
+
+```
+ERROR Client: Application diagnostics message: User class threw exception:
+java.lang.NoSuchMethodError: 'org.apache.hudi.exception.HoodieException 
+org.apache.hudi.sync.common.util.SyncUtilHelpers.getExceptionFromList(java.util.Collection)'
+```
+
+After significant debugging, the issue was traced to AWS Glue limits. The team 
implemented a multi-step fix:
+
+* Worked with AWS to temporarily raise resource limits.  
+* Developed a Python service to identify and delete outdated table versions, 
removing over 1 million entries.  
+* Added an Airflow job to schedule weekly cleanup tasks.  
+* Improved schema sync logic to trigger only when the schema changed.
+
+### Debezium & TOAST Handling
+
+PostgreSQL CDC ingestion posed unique challenges due to the database’s 
handling of large fields using TOAST (The Oversized-Attribute Storage 
Technique). When fields over 8KB were unchanged, Debezium emitted a placeholder 
value `__debezium_unavailable_value`, making it impossible to determine whether 
the value had changed.
+
+<img 
src="/assets/images/blog/2025-07-15-modernizing-datainfra-peloton-hudi/pel_fig5.png"
 alt="challenge" width="1000" align="middle"/>
+
+To address this, Peloton:
+
+* Populated initial data using PostgreSQL snapshots.  
+* Implemented self-joins between incoming CDC records and existing Hudi 
records to fill in missing values.  
+* Separated inserts, updates, and deletes within Spark batch processing.  
+* Used the `ts` field as the precombine key to ensure only the latest record 
state was retained.
+
+A reconciliation pipeline was also developed to heal data inconsistencies 
caused by multiple operations on the same key within a batch (e.g., 
create-delete-create).
+
+### Data Validation and Quality Enforcement
+
+Data quality was critical to ensure trust in the newly established data lake. 
The team developed several internal libraries and checks:
+
+* A Crypto Shredding Library to encrypt `user_id` and other PII fields before 
storage.  
+* A Data Validation Framework that compared records in the lake against 
snapshot data.  
+* A Data Quality Library that enforced column-level thresholds. These checks 
integrated with DataHub and were tied to Airflow sensors to halt downstream 
jobs on failures.
+
+### DynamoDB Ingestion and Schema Challenges
+
+Some Peloton services relied on DynamoDB for operational workloads (NoSQL). To 
ingest these datasets into the lake, the team used DynamoDB Streams and a Kafka 
Connector, allowing reuse of the existing Kafka-based Hudi ingestion path.
+
+<img 
src="/assets/images/blog/2025-07-15-modernizing-datainfra-peloton-hudi/pel_fig6.png"
 alt="challenge" width="1000" align="middle"/>
+
+However, the NoSQL nature of DynamoDB introduced schema management challenges. 
Two strategies were evaluated:
+
+1. Stakeholder-defined schemas, using SUPER-type fields.  
+2. Dynamic schema inference, where incoming JSON records were parsed, and the 
evolving schema was inferred and reconciled.
+
+The team opted for dynamic inference despite increased processing time, as it 
enabled better support for exploratory workloads. Daily snapshots and 
reconciliation steps helped clean up inconsistent schema states.
+
+### Reducing Operational Costs
+
+<img 
src="/assets/images/blog/2025-07-15-modernizing-datainfra-peloton-hudi/pel_fig7.png"
 alt="challenge" width="1000" align="middle"/>
+
+As the system matured, cost optimization became a priority. The team used 
[Ganglia](https://github.com/ganglia/) to analyze job profiles and identify 
areas for improvement:
+
+* EMR resources were gradually right-sized based on CPU and memory usage.  
+* Conditional Hive syncing was introduced to avoid unnecessary sync operations 
during each run.  
+* A Spark-side inefficiency was discovered where archived timelines were 
unnecessarily loaded, causing jobs to take 4x longer. Fixing this reduced 
overall latency and compute resource usage.
+
+These operational refinements significantly reduced idle times and improved 
the cost-efficiency of the platform.
+
+## Gains from Hudi Adoption
+
+Peloton's transition to Apache Hudi led to measurable performance, 
operational, and cost-related improvements across its modern data platform.
+
+Peloton's transition to Apache Hudi yielded several measurable improvements:
+
+* Ingestion frequency increased from once daily to every 10 minutes.  
+* Reduced snapshot job durations from an hour to under 15 minutes.  
+* Cost savings by eliminating read replicas and optimizing EMR cluster usage.  
+* Time travel support enabled retrospective analysis and model re-training.  
+* Improved compliance posture through structured deletes and encrypted PII.
+
+The modernization laid the groundwork for future evolution, including 
real-time streaming ingestion using Apache Flink and continued improvements in 
data freshness, latency, and governance.
+
+This blog is based on Peloton’s presentation at the Apache Hudi Community 
Sync. If you are interested in watching the recorded version of the video, you 
can find it [here](https://youtu.be/-Pyid5K9dyU?feature=shared).
+
+---
\ No newline at end of file
diff --git 
a/website/static/assets/images/blog/2025-07-15-modernizing-datainfra-peloton-hudi/pel_fig1.png
 
b/website/static/assets/images/blog/2025-07-15-modernizing-datainfra-peloton-hudi/pel_fig1.png
new file mode 100644
index 000000000000..52cdc65d0c20
Binary files /dev/null and 
b/website/static/assets/images/blog/2025-07-15-modernizing-datainfra-peloton-hudi/pel_fig1.png
 differ
diff --git 
a/website/static/assets/images/blog/2025-07-15-modernizing-datainfra-peloton-hudi/pel_fig2.png
 
b/website/static/assets/images/blog/2025-07-15-modernizing-datainfra-peloton-hudi/pel_fig2.png
new file mode 100644
index 000000000000..a9e988584791
Binary files /dev/null and 
b/website/static/assets/images/blog/2025-07-15-modernizing-datainfra-peloton-hudi/pel_fig2.png
 differ
diff --git 
a/website/static/assets/images/blog/2025-07-15-modernizing-datainfra-peloton-hudi/pel_fig3.png
 
b/website/static/assets/images/blog/2025-07-15-modernizing-datainfra-peloton-hudi/pel_fig3.png
new file mode 100644
index 000000000000..5d1c66831142
Binary files /dev/null and 
b/website/static/assets/images/blog/2025-07-15-modernizing-datainfra-peloton-hudi/pel_fig3.png
 differ
diff --git 
a/website/static/assets/images/blog/2025-07-15-modernizing-datainfra-peloton-hudi/pel_fig4.png
 
b/website/static/assets/images/blog/2025-07-15-modernizing-datainfra-peloton-hudi/pel_fig4.png
new file mode 100644
index 000000000000..0b57e67f8cab
Binary files /dev/null and 
b/website/static/assets/images/blog/2025-07-15-modernizing-datainfra-peloton-hudi/pel_fig4.png
 differ
diff --git 
a/website/static/assets/images/blog/2025-07-15-modernizing-datainfra-peloton-hudi/pel_fig5.png
 
b/website/static/assets/images/blog/2025-07-15-modernizing-datainfra-peloton-hudi/pel_fig5.png
new file mode 100644
index 000000000000..82467b811c25
Binary files /dev/null and 
b/website/static/assets/images/blog/2025-07-15-modernizing-datainfra-peloton-hudi/pel_fig5.png
 differ
diff --git 
a/website/static/assets/images/blog/2025-07-15-modernizing-datainfra-peloton-hudi/pel_fig6.png
 
b/website/static/assets/images/blog/2025-07-15-modernizing-datainfra-peloton-hudi/pel_fig6.png
new file mode 100644
index 000000000000..25dd80f901c3
Binary files /dev/null and 
b/website/static/assets/images/blog/2025-07-15-modernizing-datainfra-peloton-hudi/pel_fig6.png
 differ
diff --git 
a/website/static/assets/images/blog/2025-07-15-modernizing-datainfra-peloton-hudi/pel_fig7.png
 
b/website/static/assets/images/blog/2025-07-15-modernizing-datainfra-peloton-hudi/pel_fig7.png
new file mode 100644
index 000000000000..82cd5433f763
Binary files /dev/null and 
b/website/static/assets/images/blog/2025-07-15-modernizing-datainfra-peloton-hudi/pel_fig7.png
 differ
diff --git 
a/website/static/assets/images/blog/2025-07-15-modernizing-datainfra-peloton-hudi/peloton-1200x600.jpg
 
b/website/static/assets/images/blog/2025-07-15-modernizing-datainfra-peloton-hudi/peloton-1200x600.jpg
new file mode 100644
index 000000000000..916b787bce0d
Binary files /dev/null and 
b/website/static/assets/images/blog/2025-07-15-modernizing-datainfra-peloton-hudi/peloton-1200x600.jpg
 differ

(hudi) branch asf-site updated: [BLOG] Modernizing Data Infrastructure at Peloton Using Apache Hudi (#13561)

Reply via email to