Re: Test coverage of Spark

Christopher Nguyen Sat, 12 Oct 2013 14:23:23 -0700

Roman, an area I think would (a) have high impact, and (b) is relatively
not well covered is performance analysis. I'm sure most teams are doing
this internally at their respective companies, but there is no shared code
base and shared wisdom about what we're finding/improving.

For example, consider the task of loading a table from disk into memory by
Shark. We're getting conflicting data about how much of this is cpu-bound
vs I/O-bound. Our effort to track this down should be sharable somehow, and
would benefit from others' findings. Of course this is dependent on the
particular configuration, but there is a lot of test harness code/scripts
that can be shared. And individual findings, even if/especially if they are
conflicting, are very valuable if well documented.

There is a Benchmark effort covered here
https://amplab.cs.berkeley.edu/benchmark/, but it addresses a slightly
different goal. You could consider this Perf-Analysis as part of that, or
as its own effort.

This may be more than you were looking to own, but given your stated
enthusiasm :) I want to throw the idea out there.

--
Christopher T. Nguyen
Co-founder & CEO, Adatao <http://adatao.com>
linkedin.com/in/ctnguyen

On Sat, Oct 12, 2013 at 1:48 PM, Роман Ткаленко <tkalenkoro...@gmail.com>wrote:

> Hello.
> I'm trying to dive into Spark's sources on a deeper-than-mere-glance level
> and I find beginning with writing unit tests a good way to do it. So,
> basically, I'm wondering if there are points to which I could specifically
> apply my enthusiasm, i. e. are there some un- or not enough covered parts
> for which I could write some tests?
> I'm wondering as well about the state of Apache-hosted JIRA for Spark - I
> currently can't see any entry in there. Should I look for them in Github
> mirror or still in the antecedent JIRA instance on
> http://spark-project.atlassian.net/?
> Regards,
> Roman.
>

Re: Test coverage of Spark

Reply via email to