This is an automated email from the ASF dual-hosted git repository.
xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 44e654e40cdc docs(blog): add community sync blog - zupee (#17681)
44e654e40cdc is described below
commit 44e654e40cdc9ebd60b0180ffb05f2b227164eb5
Author: Shiyan Xu <[email protected]>
AuthorDate: Tue Dec 23 01:24:27 2025 -0600
docs(blog): add community sync blog - zupee (#17681)
---
...how-zupee-cut-s3-costs-60-percent-with-hudi.mdx | 170 +++++++++++++++++++++
.../image1.png | Bin 0 -> 1799237 bytes
.../image2.png | Bin 0 -> 443776 bytes
.../image3.png | Bin 0 -> 2345666 bytes
.../og.png | Bin 0 -> 1041052 bytes
5 files changed, 170 insertions(+)
diff --git
a/website/blog/2025-12-22-how-zupee-cut-s3-costs-60-percent-with-hudi.mdx
b/website/blog/2025-12-22-how-zupee-cut-s3-costs-60-percent-with-hudi.mdx
new file mode 100644
index 000000000000..65d3d1e32c9e
--- /dev/null
+++ b/website/blog/2025-12-22-how-zupee-cut-s3-costs-60-percent-with-hudi.mdx
@@ -0,0 +1,170 @@
+---
+title: "How Zupee Cut S3 Costs by 60% with Apache Hudi"
+excerpt: "How Zupee, India's largest skill-based Ludo platform, reduced S3
costs by 60% and achieved 15-minute ingestion SLAs using Apache Hudi's Metadata
Table and Hudi Streamer."
+author: The Hudi Community
+category: blog
+image:
/assets/images/blog/2025-12-22-how-zupee-cut-s3-costs-60-percent-with-hudi/og.png
+tags:
+ - hudi
+ - lakehouse
+ - case-study
+ - zupee
+---
+
+---
+
+_This post summarizes Zupee's talk from the Apache Hudi community sync. Watch
the recording on [YouTube](https://www.youtube.com/watch?v=Mpj8hnaZbe0)._
+
+---
+
+
+
+[Zupee](https://www.zupee.com/) is India's largest skill-based Ludo platform
([Ludo](https://en.wikipedia.org/wiki/Ludo) is a classic Indian board game),
founded in 2018 with a vision of bringing moments of joy to users through
meaningful entertainment. The company was the first to introduce a skill
element to culturally relevant games like Ludo, reviving the joy of traditional
Indian gaming.
+
+At Zupee, data plays a crucial role in everything they do—from understanding
user behavior to optimizing services. It sits at the core of their
decision-making process. In this community sync, Amarjeet Singh, Senior Data
Engineer at Zupee, shared how his team built a scalable data platform using
Apache Hudi and the significant performance gains they achieved.
+
+## Data Platform Architecture
+
+
+
+Zupee's data platform architecture is designed to handle complex data needs
efficiently. It consists of several layers working together:
+
+### Three-Tiered Data Lake
+
+The data lake is structured into three zones:
+
+- **Landing Zone**: Where raw data first arrives—like a receiving dock where
all incoming data is stored in its original format.
+- **Exploration Zone**: Data is cleaned and prepared for analysts and data
scientists to explore and derive insights.
+- **Analytical Zone**: Processed data optimized for analytical queries. This
layer stores OLAP tables, facts and dimensions, or denormalized wide tables.
+
+### Metastore and API Layer
+
+This layer acts as the brain of the data platform, managing metadata and
providing APIs for data access and integration.
+
+### Orchestration and Framework Layer
+
+The orchestration layer includes tools like Apache Airflow for scheduling and
managing workflows, along with in-house tools and frameworks for ML operations,
data ingestion, and data computation—including [Hudi
Streamer](https://hudi.apache.org/docs/hoodie_deltastreamer/) for data
ingestion.
+
+### Compute Layer
+
+The compute layer includes Apache Spark for large-scale data processing, and
Amazon Athena and Trino for querying.
+
+### Real-Time Serving Layer
+
+For real-time data needs, Zupee uses Apache Flink as a powerful streaming
framework to power their feature store and enable real-time model predictions.
+
+The serving layer includes real-time dashboards powered by Flink, analytical
dashboards powered by Athena and Trino, and Jupyter notebooks for ad-hoc
analysis by data scientists and machine learning teams.
+
+## Workflow-Based Data Ingestion
+
+
+
+Zupee designed a data ingestion pipeline that provides smooth integration and
scalability. At the heart of their approach is a centralized configuration
system that gives them granular control over every aspect of their jobs.
+
+### Centralized Configuration
+
+The team uses YAML files to centrally manage job details and tenant-level
settings. This promotes consistency and makes updates easy. The centralized
system is hosted on Amazon EMR, ensuring uniformity across all ingestion jobs.
+
+### Multi-Tenant Pipeline
+
+Zupee runs a multi-tenant setup of generic pipelines. They can switch between
different versions of Spark, Hudi Streamer, and Scala—all controlled at the
tenant level.
+
+### Automated Spark Command Generation
+
+The system automatically generates Spark commands based on YAML
configurations. This significantly reduces manual intervention, minimizes
errors, and accelerates the development process.
+
+### The Ingestion Flow
+
+Here's how the workflow operates:
+
+1. A generic job reads a YAML file containing specific job information and
tenant-level configurations.
+2. A single trigger starts each pipeline.
+3. The generic job creates Spark configurations and generates a `spark-submit`
command with Hudi Streamer settings.
+4. The command is submitted using Livy or EMR steps.
+5. Throughout the process, the team tracks data lineage for monitoring and
debugging.
+
+This workflow-based approach streamlines data ingestion, making it both
scalable and reliable for handling large volumes of data.
+
+## Real-Time Ingestion with Hudi Streamer
+
+Zupee uses [Hudi
Streamer](https://hudi.apache.org/docs/hoodie_deltastreamer/), a utility built
on checkpoint-based ingestion. It supports various sources including
distributed file systems (S3, GCS), Kafka, and JDBC sources like MySQL and
MongoDB.
+
+### Key Benefits
+
+- **Checkpoint-based consistency**: Ensures the ability to resume from the
last checkpoint in case of interruption.
+- **Easy backfills**: Reprocess historical data efficiently based on
checkpoints, without re-ingesting the entire dataset—saving both time and
resources.
+- **Data catalog syncing**: Automatically syncs metadata to catalogs such as
AWS Glue and Hive Metastore.
+- **Built-in transformations**: Supports SQL transformations and flat
transformations, allowing complex logic to be applied on the fly.
+- **Custom transformations**: Developers can build custom transformation
classes for specific requirements.
+
+### Deep Dive: How Hudi Streamer Works
+
+
+
+Here's how Hudi Streamer processes data internally:
+
+1. **Job Submission**: A Spark job is submitted with Hudi properties
specifying primary keys, checkpointing details, and other configurations.
+
+2. **StreamSync Wrapper**: The job creates a StreamSync wrapper (formerly
DeltaSync) based on the provided properties.
+
+3. **Continuous or Single Run**: StreamSync initiates a synchronized process
that can run continuously or as a single batch, depending on the parameters.
+
+4. **Checkpoint Retrieval**: Hudi Streamer checks if a checkpoint exists and
retrieves the last position from the Hudi commit metadata. For S3-based
ingestion, it checks the last modified date; for Kafka sources, it retrieves
the last committed offset.
+
+5. **Transformation**: If transformations are configured (single or chained),
they are applied to the source data based on the format (JSON, Avro, etc.).
+
+6. **Write and Compaction**: Data is written to the Hudi table. For
Merge-On-Read tables, compaction runs inline or asynchronously to merge log
files with base files.
+
+7. **Metadata Sync**: Finally, if enabled, metadata is synced to ensure
catalog consistency.
+
+### Custom Solutions
+
+The Zupee team developed several custom solutions to enhance their Hudi
Streamer pipeline:
+
+- **Dynamic Schema Generator**: A custom class that dynamically generates
schemas for JSON sources, enabling automatic schema creation based on incoming
data.
+- **Post-Process Transformations**: Custom transformations to handle schema
evolution at the source level.
+- **Centralized Configurations**: Hudi configurations managed centrally via
YAML files or `hoodie-config.xml`, simplifying maintenance and updates.
+- **Raw Data Handling**: For raw data ingestion, they discovered that Hudi
Streamer can infer schemas automatically without requiring a schema provider.
+
+## Results: Cost Savings and Performance Gains
+
+The migration to a Hudi-powered platform delivered significant outcomes:
+
+### 60% Reduction in S3 Network Costs
+
+After migrating from Hudi 0.10.x to 0.12.3 and enabling the [Metadata
Table](https://hudi.apache.org/docs/metadata/), Zupee reduced S3 network costs
by over 60%. The metadata table eliminates expensive S3 file listing operations
by maintaining an internal index of all files. The key settings are
`hoodie.metadata.enable=true` for Spark and
`hudi.metadata.listing-enabled=true` for Athena.
+
+### 15-Minute Ingestion SLA
+
+The team achieved a 15-minute SLA for ingesting 2-5 million records using
[Merge-On-Read (MOR)](https://hudi.apache.org/docs/table_types/) tables with
Hudi's indexing for efficient record lookups during upserts.
+
+### 30% Storage Reduction
+
+Switching from Snappy to ZSTD compression resulted in a 30% decrease in data
size. While write times increased slightly, query performance improved
significantly, and both storage and Athena costs decreased.
+
+### Small File Management
+
+[Async compaction](https://hudi.apache.org/docs/compaction/) handles the small
file problem by consolidating smaller files into larger ones, configured via
`hoodie.parquet.small.file.limit` and `hoodie.parquet.max.file.size`. The team
also explored Parquet page-level indexing to reduce query costs by allowing
engines to read only relevant pages from files.
+
+## Why Hudi Over Other Table Formats?
+
+During the Q&A, Amarjeet explained why Zupee chose Hudi over other table
formats. The team ran POCs with Delta Lake but found that for near-real-time
ingestion, Hudi performs much better. Since Zupee primarily works with
real-time data rather than batch workloads, this was the deciding factor.
+
+The native [Hudi
Streamer](https://hudi.apache.org/docs/hoodie_streaming_ingestion#hudi-streamer)
utility also adds significant value with its built-in checkpoint management,
compaction, and catalog syncing—features that would require additional work
with other ingestion approaches.
+
+## Best Practices for EMR Upgrades
+
+When asked about upgrading Hudi on EMR, Amarjeet shared their approach:
instead of using the EMR-provided JARs, they use open-source JARs. This way,
they can upgrade JARs using their multi-tenant framework, which controls which
JAR goes to which job. They can also use different Spark versions for different
jobs. For example, if they are currently running Hudi 0.12.3 and want to test
version 0.14.1, they simply specify the JAR in their YAML file for that
particular job.
+
+## Conclusion
+
+Zupee's journey with Apache Hudi demonstrates how a modern data platform can
solve real-world engineering challenges at scale. By moving from legacy
architecture to a Hudi-powered lakehouse, they built a platform that is robust,
scalable, and cost-effective.
+
+The keys to their success were:
+
+- Enabling Hudi's [Metadata Table](https://hudi.apache.org/docs/metadata/) to
eliminate file listing overhead
+- Using [Merge-On-Read](https://hudi.apache.org/docs/table_types/) tables with
indexing for efficient upserts
+- Building custom transformations for flexible schema evolution
+- Centralizing configuration management for operational simplicity
+
+> "I'm looking forward to more contributing to the community." – Amarjeet
Singh, Senior Data Engineer at Zupee
diff --git
a/website/static/assets/images/blog/2025-12-22-how-zupee-cut-s3-costs-60-percent-with-hudi/image1.png
b/website/static/assets/images/blog/2025-12-22-how-zupee-cut-s3-costs-60-percent-with-hudi/image1.png
new file mode 100644
index 000000000000..3346323dd819
Binary files /dev/null and
b/website/static/assets/images/blog/2025-12-22-how-zupee-cut-s3-costs-60-percent-with-hudi/image1.png
differ
diff --git
a/website/static/assets/images/blog/2025-12-22-how-zupee-cut-s3-costs-60-percent-with-hudi/image2.png
b/website/static/assets/images/blog/2025-12-22-how-zupee-cut-s3-costs-60-percent-with-hudi/image2.png
new file mode 100644
index 000000000000..99dfe9118d14
Binary files /dev/null and
b/website/static/assets/images/blog/2025-12-22-how-zupee-cut-s3-costs-60-percent-with-hudi/image2.png
differ
diff --git
a/website/static/assets/images/blog/2025-12-22-how-zupee-cut-s3-costs-60-percent-with-hudi/image3.png
b/website/static/assets/images/blog/2025-12-22-how-zupee-cut-s3-costs-60-percent-with-hudi/image3.png
new file mode 100644
index 000000000000..ec01188f990f
Binary files /dev/null and
b/website/static/assets/images/blog/2025-12-22-how-zupee-cut-s3-costs-60-percent-with-hudi/image3.png
differ
diff --git
a/website/static/assets/images/blog/2025-12-22-how-zupee-cut-s3-costs-60-percent-with-hudi/og.png
b/website/static/assets/images/blog/2025-12-22-how-zupee-cut-s3-costs-60-percent-with-hudi/og.png
new file mode 100644
index 000000000000..7e0b25c4973c
Binary files /dev/null and
b/website/static/assets/images/blog/2025-12-22-how-zupee-cut-s3-costs-60-percent-with-hudi/og.png
differ