Roman, an area I think would (a) have high impact, and (b) is relatively not well covered is performance analysis. I'm sure most teams are doing this internally at their respective companies, but there is no shared code base and shared wisdom about what we're finding/improving.
For example, consider the task of loading a table from disk into memory by Shark. We're getting conflicting data about how much of this is cpu-bound vs I/O-bound. Our effort to track this down should be sharable somehow, and would benefit from others' findings. Of course this is dependent on the particular configuration, but there is a lot of test harness code/scripts that can be shared. And individual findings, even if/especially if they are conflicting, are very valuable if well documented. There is a Benchmark effort covered here https://amplab.cs.berkeley.edu/benchmark/, but it addresses a slightly different goal. You could consider this Perf-Analysis as part of that, or as its own effort. This may be more than you were looking to own, but given your stated enthusiasm :) I want to throw the idea out there. -- Christopher T. Nguyen Co-founder & CEO, Adatao <http://adatao.com> linkedin.com/in/ctnguyen On Sat, Oct 12, 2013 at 1:48 PM, Роман Ткаленко <tkalenkoro...@gmail.com>wrote: > Hello. > I'm trying to dive into Spark's sources on a deeper-than-mere-glance level > and I find beginning with writing unit tests a good way to do it. So, > basically, I'm wondering if there are points to which I could specifically > apply my enthusiasm, i. e. are there some un- or not enough covered parts > for which I could write some tests? > I'm wondering as well about the state of Apache-hosted JIRA for Spark - I > currently can't see any entry in there. Should I look for them in Github > mirror or still in the antecedent JIRA instance on > http://spark-project.atlassian.net/? > Regards, > Roman. >