Hi Zoi and the Wayang community,

My name is Sai Mudragada, and I am writing to introduce myself as a
prospective GSoC 2026 contributor for Apache Wayang. I graduated in
December 2025 with a focus on Data Science and ML Engineering, and I am
excited about the "Support for a DataFrame API" project idea.

I have spent time studying the Wayang codebase — the operator model, plan
compilation pipeline, Python bindings, and existing platform
implementations. I have also reviewed the mailing list discussions around
the DataFrame API to understand the community's direction.

Why this project fits my background:

- I have built production-grade data pipelines and ML systems in Python,
including end-to-end pipelines using pandas, scikit-learn, PyTorch,
FastAPI, and Docker.
- My MustangsAI project is a full RAG pipeline with document ingestion and
vector retrieval — architecturally similar to what the DataFrame API needs:
a clean translation layer from high-level user intent to an efficient
execution plan.
- My RevenueBoost-ML project demonstrates pipeline engineering with A/B
testing, Docker, and CI/CD — the production mindset needed to build a
reliable, well-tested DataFrame API.
- I am deeply familiar with the DataFrame paradigm as a user (pandas,
Spark-style APIs), which means I can design an API that feels natural to
data professionals.

My proposed approach:

I plan to implement the DataFrame API as a new module (wayang-dataframe)
that sits above the existing operator layer, with a three-stage design: (1)
a user-facing Python API with lazy evaluation, (2) a logical plan compiler
that translates DataFrame operations to Wayang operators, and (3)
delegation to Wayang's existing optimizer and executor for cross-backend
routing.

Core operations I plan to cover: select, filter, withColumn, drop,
distinct, limit, groupBy, agg, join, sort — plus I/O methods (read_csv,
read_parquet) and actions (count, collect, show, toPandas). I also plan to
include cross-backend integration tests validating the same program on Java
Streams and Apache Spark.

I have attached a full proposal document with detailed technical design,
week-by-week timeline, and deliverables. I would love to get the
community's feedback on the architectural direction before I finalize and
submit on the GSoC portal.

A few questions I would appreciate guidance on:
1. Is there any existing partial work or prior discussion on the DataFrame
API I should be aware of beyond the Jira ticket?
2. Is the 350-hour (large) scope preferred, or would a focused 175-hour
scope on core operations be more aligned with what the community wants?
3. Are there specific backends (e.g., Spark vs Java Streams) you'd like
prioritized for cross-backend validation?

Thank you for your time. I look forward to engaging with the community!

Best regards,
Sai Mudragada
[email protected]
github.com/Saimudragada
saimudragadaportfolio.vercel.app

Attachment: SAI_Apache_Wayang_GSoC2026_Proposal.docx
Description: MS-Word 2007 document

Reply via email to