Hi Matt,

If you want to get started looking at Spark, I recommend the following 
resources:

- Our issue tracker at http://spark-project.atlassian.net contains some issues 
marked “Starter” that are good places to jump into. You might be able to take 
one of those and extend it into a bigger project.

- The “contributing to Spark” wiki page covers how to send patches and set up 
development: 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 

- This talk has an intro to Spark internals (video and slides are in the 
comments): http://www.meetup.com/spark-users/events/94101942/

For a longer project, here are some possible ones:

- Create a tool that automatically checks which Scala API methods are missing 
in Python. We had a similar one for Java that was very useful. Even better 
would be to automatically create wrappers for the Scala ones.

- Extend the Spark monitoring UI with profiling information (to sample the 
workers and say where they’re spending time, or what data structures consume 
the most memory).

- Pick and implement a new machine learning algorithm for MLlib.

Matei

On Dec 17, 2013, at 10:43 AM, Matthew Cheah <mcch...@uwaterloo.ca> wrote:

> Hi everyone,
> 
> During my most recent internship, I worked extensively with Apache Spark,
> integrating it into a company's data analytics platform. I've now become
> interested in contributing to Apache Spark.
> 
> I'm returning to undergraduate studies in January and there is an academic
> course which is simply a standalone software engineering project. I was
> thinking that some contribution to Apache Spark would satisfy my curiosity,
> help continue support the company I interned at, and give me academic
> credits required to graduate, all at the same time. It seems like too good
> an opportunity to pass up.
> 
> With that in mind, I have the following questions:
> 
>   1. At this point, is there any self-contained project that I could work
>   on within Spark? Ideally, I would work on it independently, in about a
>   three month time frame. This time also needs to accommodate ramping up on
>   the Spark codebase and adjusting to the Scala programming language and
>   paradigms. The company I worked at primarily used the Java APIs. The output
>   needs to be a technical report describing the project requirements, and the
>   design process I took to engineer the solution for the requirements. In
>   particular, it cannot just be a series of haphazard patches.
>   2. How can I get started with contributing to Spark?
>   3. Is there a high-level UML or some other design specification for the
>   Spark architecture?
> 
> Thanks! I hope to be of some help =)
> 
> -Matt Cheah

Reply via email to