Hi Zoi and the Wayang community, My name is Sai Mudragada, and I am writing to introduce myself as a prospective GSoC 2026 contributor for Apache Wayang. I graduated in December 2025 with a focus on Data Science and ML Engineering, and I am excited about the "Support for a DataFrame API" project idea.
I have spent time studying the Wayang codebase — the operator model, plan compilation pipeline, Python bindings, and existing platform implementations. I have also reviewed the mailing list discussions around the DataFrame API to understand the community's direction. Why this project fits my background: - I have built production-grade data pipelines and ML systems in Python, including end-to-end pipelines using pandas, scikit-learn, PyTorch, FastAPI, and Docker. - My MustangsAI project is a full RAG pipeline with document ingestion and vector retrieval — architecturally similar to what the DataFrame API needs: a clean translation layer from high-level user intent to an efficient execution plan. - My RevenueBoost-ML project demonstrates pipeline engineering with A/B testing, Docker, and CI/CD — the production mindset needed to build a reliable, well-tested DataFrame API. - I am deeply familiar with the DataFrame paradigm as a user (pandas, Spark-style APIs), which means I can design an API that feels natural to data professionals. My proposed approach: I plan to implement the DataFrame API as a new module (wayang-dataframe) that sits above the existing operator layer, with a three-stage design: (1) a user-facing Python API with lazy evaluation, (2) a logical plan compiler that translates DataFrame operations to Wayang operators, and (3) delegation to Wayang's existing optimizer and executor for cross-backend routing. Core operations I plan to cover: select, filter, withColumn, drop, distinct, limit, groupBy, agg, join, sort — plus I/O methods (read_csv, read_parquet) and actions (count, collect, show, toPandas). I also plan to include cross-backend integration tests validating the same program on Java Streams and Apache Spark. I have attached a full proposal document with detailed technical design, week-by-week timeline, and deliverables. I would love to get the community's feedback on the architectural direction before I finalize and submit on the GSoC portal. A few questions I would appreciate guidance on: 1. Is there any existing partial work or prior discussion on the DataFrame API I should be aware of beyond the Jira ticket? 2. Is the 350-hour (large) scope preferred, or would a focused 175-hour scope on core operations be more aligned with what the community wants? 3. Are there specific backends (e.g., Spark vs Java Streams) you'd like prioritized for cross-backend validation? Thank you for your time. I look forward to engaging with the community! Best regards, Sai Mudragada [email protected] github.com/Saimudragada saimudragadaportfolio.vercel.app
SAI_Apache_Wayang_GSoC2026_Proposal.docx
Description: MS-Word 2007 document
