Hi Matt, If you want to get started looking at Spark, I recommend the following resources:
- Our issue tracker at http://spark-project.atlassian.net contains some issues marked “Starter” that are good places to jump into. You might be able to take one of those and extend it into a bigger project. - The “contributing to Spark” wiki page covers how to send patches and set up development: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark - This talk has an intro to Spark internals (video and slides are in the comments): http://www.meetup.com/spark-users/events/94101942/ For a longer project, here are some possible ones: - Create a tool that automatically checks which Scala API methods are missing in Python. We had a similar one for Java that was very useful. Even better would be to automatically create wrappers for the Scala ones. - Extend the Spark monitoring UI with profiling information (to sample the workers and say where they’re spending time, or what data structures consume the most memory). - Pick and implement a new machine learning algorithm for MLlib. Matei On Dec 17, 2013, at 10:43 AM, Matthew Cheah <mcch...@uwaterloo.ca> wrote: > Hi everyone, > > During my most recent internship, I worked extensively with Apache Spark, > integrating it into a company's data analytics platform. I've now become > interested in contributing to Apache Spark. > > I'm returning to undergraduate studies in January and there is an academic > course which is simply a standalone software engineering project. I was > thinking that some contribution to Apache Spark would satisfy my curiosity, > help continue support the company I interned at, and give me academic > credits required to graduate, all at the same time. It seems like too good > an opportunity to pass up. > > With that in mind, I have the following questions: > > 1. At this point, is there any self-contained project that I could work > on within Spark? Ideally, I would work on it independently, in about a > three month time frame. This time also needs to accommodate ramping up on > the Spark codebase and adjusting to the Scala programming language and > paradigms. The company I worked at primarily used the Java APIs. The output > needs to be a technical report describing the project requirements, and the > design process I took to engineer the solution for the requirements. In > particular, it cannot just be a series of haphazard patches. > 2. How can I get started with contributing to Spark? > 3. Is there a high-level UML or some other design specification for the > Spark architecture? > > Thanks! I hope to be of some help =) > > -Matt Cheah