(hudi) branch asf-site updated: docs(blog): add Hudi Upstox blog (#14106)

xushiyan Fri, 17 Oct 2025 19:08:22 -0700

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 9cd6e73a50d8 docs(blog): add Hudi Upstox blog (#14106)
9cd6e73a50d8 is described below

commit 9cd6e73a50d8ce012d273028f4f11fd4741fcab4
Author: Shiyan Xu <[email protected]>
AuthorDate: Thu Oct 16 16:34:21 2025 -0500

    docs(blog): add Hudi Upstox blog (#14106)
    
    Modernizing Upstox's Data Platform with Apache Hudi, dbt, and EMR Serverless
---
 ...form-with-Apache-Hudi-DBT-and-EMR-Serverless.md |  93 +++++++++++++++++++++
 .../fig1.png                                       | Bin 0 -> 246612 bytes
 .../fig2.png                                       | Bin 0 -> 1119798 bytes
 .../fig3.png                                       | Bin 0 -> 361280 bytes
 4 files changed, 93 insertions(+)

diff --git 
a/website/blog/2025-10-16-Modernizing-Upstox-Data-Platform-with-Apache-Hudi-DBT-and-EMR-Serverless.md
 
b/website/blog/2025-10-16-Modernizing-Upstox-Data-Platform-with-Apache-Hudi-DBT-and-EMR-Serverless.md
new file mode 100644
index 000000000000..a77d050d4191
--- /dev/null
+++ 
b/website/blog/2025-10-16-Modernizing-Upstox-Data-Platform-with-Apache-Hudi-DBT-and-EMR-Serverless.md
@@ -0,0 +1,93 @@
+---
+title: "Modernizing Upstox's Data Platform with Apache Hudi, dbt, and EMR 
Serverless"
+excerpt: ""
+author: The Hudi Community
+category: blog
+image: 
/assets/images/blog/2025-10-16-Modernizing-Upstox-Data-Platform-with-Apache-Hudi-DBT-and-EMR-Serverless/fig1.png
+tags:
+- hudi
+- upstox
+- dbt
+- data lakehouse
+---
+
+## Introduction
+
+In [this community sharing 
session](https://www.youtube.com/watch?v=dAM2zOvnPmw), Manish Gaurav from 
Upstox shared insights into the complexities of managing data ingestion at 
scale. Drawing from the company’s experience as a leading online trading 
platform in India, the discussion highlighted challenges around file-level 
upserts, ensuring atomic operations, and handling small files effectively. 
Upstox shared how they built a modern data platform using Apache Hudi and dbt 
to address thes [...]
+
+Upstox is a leading online trading platform that enables millions of users to 
invest in equities, commodities, derivatives, and currencies. With over 12 
million customers generating 300,000 data requests daily, the company's data 
team is responsible for delivering the real-time insights that power key 
products, including:
+
+* Search functionality  
+* A customer service chatbot (powered by OpenAI)  
+* Personalized portfolio recommendations
+
+![](/assets/images/blog/2025-10-16-Modernizing-Upstox-Data-Platform-with-Apache-Hudi-DBT-and-EMR-Serverless/fig1.png)
+
+## Data Sources
+
+Upstox ingests 250–300 GB of structured and semi-structured data per day from 
a variety of sources:
+
+* Order and transaction data from exchanges  
+* Microservice telemetry from Cloudflare  
+* Customer support data from platforms like Freshdesk and SquadStack  
+* Behavioral analytics from Mixpanel  
+* Data from operational databases (MongoDB, MySQL, and MS SQL) via AWS DMS
+
+## The Challenges with Initial Data Platform
+
+As Upstox grew, so did the complexity of its data operations. Here are some of 
the early bottlenecks the company faced:
+
+### Data Ingestion Issues
+
+Prior to 2023, Upstox relied on no-code ingestion platforms like Hevo. While 
easy to adopt, these platforms introduced several limitations, including high 
licensing costs and a lack of fine-grained control over ingestion logic. 
File-level upserts required complex joins between incoming CDC (change data 
capture) datasets and target tables. Additionally, a lack of atomicity often 
led to inconsistent data writes, and small-file issues were rampant. To combat 
these problems, the team had to  [...]
+
+### Downstream Consumption Struggles
+
+Analytics queries were primarily served through Amazon Athena, which presented 
several key limitations. For instance, it frequently timed out when querying 
large datasets and often exceeded the maximum number of partitions it could 
handle. Additionally, Athena's lack of support for stored procedures made it 
challenging to manage and reuse complex query logic. Attempts to improve 
performance with bucketing often created more small files, and the lack of 
native support for incremental quer [...]
+
+## The Modern Lakehouse Architecture
+
+![](/assets/images/blog/2025-10-16-Modernizing-Upstox-Data-Platform-with-Apache-Hudi-DBT-and-EMR-Serverless/fig2.png)
+
+To tackle these problems, Upstox implemented a medallion architecture, 
organizing data into bronze, silver, and gold layers:
+
+* **Bronze (Raw Data):** Data is ingested and stored in its raw format as 
Parquet files.  
+* **Silver (Cleaned and Filtered):** Data is cleaned, filtered, and stored in 
Apache Hudi tables, which are updated incrementally.  
+* **Gold (Business-Ready):** Data is aggregated for specific business use 
cases, modeled with dbt, and stored in Hudi.
+
+### The Solution: A Modern Stack with Hudi, dbt, and EMR Serverless
+
+Upstox re-architected its platform using Apache Hudi as the core data lake 
technology, dbt for transformations, and EMR Serverless for scalable compute. 
Airflow was used to orchestrate the entire workflow. Here's how this new stack 
addressed their challenges:
+
+**Simplified Data Updates:** Hudi provides built-in support for record-level 
upserts with atomic guarantees and snapshot isolation. This helped Upstox 
overcome the challenge of ensuring consistent updates to their fact and 
dimension tables.
+
+**Improved Upsert Performance:** To optimize upsert performance, the team 
leveraged Bloom index, especially for transaction-heavy fact tables. Indexing 
strategies were chosen based on data characteristics to balance latency and 
efficiency.
+
+**Resolved Small-File Issues:** Small files, which are common in streaming 
workloads, were mitigated using clustering jobs supported by Hudi. This process 
was scheduled to run weekly and ensured efficient file sizes and reduced 
storage overhead without manual intervention.
+
+**Enabled Incremental Processing:** Incremental joins allowed Upstox to 
process only new data daily. This enabled timely updates to the aggregated 
tables in the gold layer that power user-facing dashboards—a task that was not 
feasible with traditional Athena queries.
+
+**Managed Metadata Growth:** The accumulation of commit and metadata files in 
the Hudi table’s \`.hoodie/\` directory increased S3 listing costs and slowed 
down operations. Hudi's archival feature helped manage this by archiving older 
commits after a certain threshold, keeping metadata lean and efficient.
+
+**Streamlined Data Modeling:** The team used dbt on EMR Serverless to create 
materialized views over the Hudi datasets. This enabled the creation of 
efficient transformation layers (silver and gold) using familiar SQL workflows 
and managed compute.
+
+**Flexible Data Materialization:** dbt supported a variety of model types, 
including tables, views, and ephemeral models (Common Table Expressions, or 
CTEs). This gave teams the flexibility to optimize for performance, reuse, or 
simplicity, depending on the use case.
+
+**Out-of-the-Box Lineage and Documentation:** dbt helps visualize how data 
flows from one table to another, making it easier to debug and understand 
dependencies. The glossary feature allows teams to document column meanings and 
transformations clearly.
+
+**Enforced Data Quality:** With dbt, specific data quality rules can be added 
to individual tables or pipelines. This adds an extra layer of validation 
beyond the basic checks performed during data ingestion.
+
+### CI/CD and Orchestration
+
+![](/assets/images/blog/2025-10-16-Modernizing-Upstox-Data-Platform-with-Apache-Hudi-DBT-and-EMR-Serverless/fig3.png)
+
+
+Upstox uses Apache Airflow for orchestration, with dbt pipelines deployed via 
a Git-based CI/CD process. Merging a pull request in GitLab triggers the CI/CD 
pipeline, which automatically builds a new dbt image and publishes the updated 
data catalog. Airflow then runs the corresponding dbt jobs daily or on-demand, 
automating the entire transformation workflow.
+
+### The Impact
+
+The adoption of this modern data stack had a significant impact on Upstox's 
data platform. The company achieved extremely high data availability and 
consistency for critical datasets, reducing SLA breaches for complex joins by 
70%. Furthermore, pipeline costs dropped by 40%, and query performance improved 
drastically thanks to Hudi's clustering and optimized joins.
+
+## Conclusion
+
+By leveraging Apache Hudi, dbt, and EMR Serverless, Upstox built a robust and 
cost-efficient data platform to serve its 12M+ customers, overcoming the 
significant challenges of data ingestion and analytics at scale. This 
transformation resolved critical issues like inconsistent data writes, 
small-file problems, and query timeouts, leading to tangible improvements in 
both performance and efficiency. With a 70% reduction in SLA breaches and a 40% 
drop in pipeline costs, the new architectur [...]
diff --git 
a/website/static/assets/images/blog/2025-10-16-Modernizing-Upstox-Data-Platform-with-Apache-Hudi-DBT-and-EMR-Serverless/fig1.png
 
b/website/static/assets/images/blog/2025-10-16-Modernizing-Upstox-Data-Platform-with-Apache-Hudi-DBT-and-EMR-Serverless/fig1.png
new file mode 100644
index 000000000000..65c97a3517a3
Binary files /dev/null and 
b/website/static/assets/images/blog/2025-10-16-Modernizing-Upstox-Data-Platform-with-Apache-Hudi-DBT-and-EMR-Serverless/fig1.png
 differ
diff --git 
a/website/static/assets/images/blog/2025-10-16-Modernizing-Upstox-Data-Platform-with-Apache-Hudi-DBT-and-EMR-Serverless/fig2.png
 
b/website/static/assets/images/blog/2025-10-16-Modernizing-Upstox-Data-Platform-with-Apache-Hudi-DBT-and-EMR-Serverless/fig2.png
new file mode 100644
index 000000000000..2c8cc8a9bea1
Binary files /dev/null and 
b/website/static/assets/images/blog/2025-10-16-Modernizing-Upstox-Data-Platform-with-Apache-Hudi-DBT-and-EMR-Serverless/fig2.png
 differ
diff --git 
a/website/static/assets/images/blog/2025-10-16-Modernizing-Upstox-Data-Platform-with-Apache-Hudi-DBT-and-EMR-Serverless/fig3.png
 
b/website/static/assets/images/blog/2025-10-16-Modernizing-Upstox-Data-Platform-with-Apache-Hudi-DBT-and-EMR-Serverless/fig3.png
new file mode 100644
index 000000000000..519c7a61a556
Binary files /dev/null and 
b/website/static/assets/images/blog/2025-10-16-Modernizing-Upstox-Data-Platform-with-Apache-Hudi-DBT-and-EMR-Serverless/fig3.png
 differ

(hudi) branch asf-site updated: docs(blog): add Hudi Upstox blog (#14106)

Reply via email to