Matt, some suggestions. If you're interested in the machine-learning layer, perhaps you could look into helping to harmonize our (Adatao) dataframe representation with MLlib's, and base RDDs for that matter. It requires someone to spend some dedicated time looking into the trade-offs between generalizability vs performance issues, etc. It's something our groups have talked about doing but haven't been able to invest the resources to do.
Separately, neural nets/deep learning is an area of emerging interest to look into with Spark. It may drive some alternate optimization patterns for Spark, e.g., sub-cluster communication. If interested, I can connect you to some deep learning folks at UoT (not too far from you) and Google. Matei may also have some interest in this. -- Christopher T. Nguyen Co-founder & CEO, Adatao <http://adatao.com> linkedin.com/in/ctnguyen On Tue, Dec 17, 2013 at 10:43 AM, Matthew Cheah <mcch...@uwaterloo.ca>wrote: > Hi everyone, > > During my most recent internship, I worked extensively with Apache Spark, > integrating it into a company's data analytics platform. I've now become > interested in contributing to Apache Spark. > > I'm returning to undergraduate studies in January and there is an academic > course which is simply a standalone software engineering project. I was > thinking that some contribution to Apache Spark would satisfy my curiosity, > help continue support the company I interned at, and give me academic > credits required to graduate, all at the same time. It seems like too good > an opportunity to pass up. > > With that in mind, I have the following questions: > > 1. At this point, is there any self-contained project that I could work > on within Spark? Ideally, I would work on it independently, in about a > three month time frame. This time also needs to accommodate ramping up > on > the Spark codebase and adjusting to the Scala programming language and > paradigms. The company I worked at primarily used the Java APIs. The > output > needs to be a technical report describing the project requirements, and > the > design process I took to engineer the solution for the requirements. In > particular, it cannot just be a series of haphazard patches. > 2. How can I get started with contributing to Spark? > 3. Is there a high-level UML or some other design specification for the > Spark architecture? > > Thanks! I hope to be of some help =) > > -Matt Cheah >