Personal Information
*
Name: Siddharth Shehria
*
GitHub ID: <https://github.com/sidshehria>
sidshehria<https://github.com/sidshehria>
*
Email: [email protected]
*
LinkedIn: <https://linkedin.com/in/sidshehria>
linkedin.com/in/sidshehria<https://linkedin.com/in/sidshehria>
*
Time Zone & Available Hours Per Week: IST (Indian Standard Time), 25-30 hours
per week
Project Proposal
Title
Improving Python Bindings in Apache DataFusion
Synopsis
Apache DataFusion offers Python bindings that enable users to build data
systems using Python. However, these bindings are relatively low-level and do
not expose all APIs that libraries like Pandas and Polars provide with an
end-user focus. This project aims to enhance DataFusion’s Python bindings by
adding high-level abstractions and better API support to improve usability and
performance, making it more accessible to the broader data science and
analytics community.
Benefits to the Community
*
Improves the Python API usability of Apache DataFusion, making it more
accessible for data engineers and analysts.
*
Bridges the gap between low-level bindings and high-level usability found in
Pandas and Polars.
*
Expands DataFusion's reach by making it easier to integrate with data science
workflows.
*
Enhances performance by optimizing APIs and query execution, making DataFusion
a competitive choice for analytics applications.
*
Aligns DataFusion with modern data processing libraries, encouraging adoption
within the open-source and industry ecosystem.
Deliverables & Milestones
Timeline
Deliverable
Community Bonding (May-June)
Engage with mentors, understand the existing Python bindings, and finalize the
project roadmap.
Phase 1 (June-July)
Implement missing high-level APIs, improve type annotations, and ensure feature
parity with Pandas and Polars where applicable.
Phase 2 (July-August)
Optimize performance, improve documentation, and write comprehensive unit tests.
Final Evaluation (August-September)
Deliver production-ready bindings, complete tutorials, and submit final reports.
Technical Details
*
Programming Languages: Python, Rust
*
Libraries & Tools: DataFusion, PyO3, Pandas, Polars
*
Key Focus Areas:
*
Exposing additional APIs for data manipulation and transformation.
*
Improving dataframe interoperability with Pandas and Polars.
*
Optimizing the FFI (Foreign Function Interface) layer for better performance.
*
Enhancing documentation and examples for Python users.
Related Work & References
*
Apache DataFusion<https://arrow.apache.org/datafusion/>
*
PyO3 - Rust bindings for Python<https://pyo3.rs/>
*
Pandas API Reference<https://pandas.pydata.org/docs/reference/>
*
Polars API Reference<https://pola.rs/docs/reference/>
Personal Experience
Relevant Skills & Background
*
Languages: Python, Rust, SQL, JavaScript, C++
*
Data Analysis: Pandas, NumPy, Scikit-Learn, Power BI, Tableau
*
Backend Development: FastAPI, Flask, Node.js, PostgreSQL
*
Cloud & DevOps: Docker, AWS, Google Cloud Platform
*
Open Source Experience:
*
Contributed to TwitterOSS by optimizing key data processing metrics.
*
Developed REST APIs and data pipelines for Unified Mentor.
*
Built data analysis dashboards at Vizipa using Power BI and SQL.
Past Open-Source Contributions
*
TwitterOSS: Implemented Python-based data pipelines, reducing execution time by
30%.
Link: https://github.com/twitter/cloudhopper-commons/pull/42
Learning Plan
*
Deepen understanding of DataFusion’s Rust internals.
*
Study PyO3 for improving Python-Rust interoperability.
*
Collaborate with the community to identify pain points and improvements.
*
Regularly test implementations against real-world datasets.
Mentor & Communication
*
Preferred Communication Channels: Slack, GitHub Discussions, Email
*
Weekly Progress Updates Plan:
*
Submit weekly reports on GitHub.
*
Engage in mentor check-ins for feedback and improvements.
*
Share learnings and challenges with the open-source community.
Additional Information
I am deeply passionate about data engineering and analytics, and I believe this
project will allow me to contribute meaningfully to Apache DataFusion while
honing my expertise in Python-Rust interoperability. My previous experience in
API development, data pipelines, and open-source contributions equips me to
tackle this project successfully. I look forward to working with the community
to make DataFusion’s Python bindings more powerful and user-friendly!