Gentle ping to dev for help. I hope this effort is not abandoned.

On 25 May 2017 9:41 am, "Josh Rosen" <joshro...@databricks.com> wrote:

I'm interested in using the Scapegoat
<https://github.com/sksamuel/scapegoat> Scala compiler plugin to find
potential bugs and performance problems in Spark. Scapegoat has a useful
built-in set of inspections and is pretty easy to extend with custom ones.
For example, I added an inspection to spot places where we call *.apply()* on
a Seq which is not an IndexedSeq
<https://github.com/sksamuel/scapegoat/pull/159> in order to make it easier
to spot potential O(n^2) performance bugs.

There are lots of false-positives and benign warnings (as with any linter /
static analyzer) so I don't think it's feasible to us to include this as a
blocking step in our regular build. I am planning to build tooling to
surface only new warnings so going forward this can become a useful
code-review aid.

The current codebase has roughly 1700 warnings that I would like to triage
and categorize as false-positives or real bugs. I can't do this alone, so
here's how you can help:

   - Visit the Google Docs spreadsheet at https://docs.google.com/
   spreadsheets/d/1z7xNMjx7VCJLCiHOHhTth7Hh4R0F6LwcGjEwCDzrCiM/edit?usp=
   sharing
   
<https://docs.google.com/spreadsheets/d/1z7xNMjx7VCJLCiHOHhTth7Hh4R0F6LwcGjEwCDzrCiM/edit?usp=sharing>
and
   find an un-triaged warning.
   - In the columns at the right of the sheet, enter your name in the
   appropriate column to mark a warning as a false-positive or as a real bug
   and/or performance issue. If think a warning is a real issue then use the
   "comments" column for providing additional detail.
   - Please don't file JIRAs or PRs for individual warnings; I suspect that
   we'll find clusters of issues which are best fixed in a few larger PRs vs.
   lots of smaller ones. Certain warnings are probably simply style issues so
   we should discuss those before trying to fix them.

The sheet has hidden columns capturing the Spark revision and Scapegoat
revision. I can use this to programmatically update the sheet and remap
lines after updating either Scapegoat (to suppress false-positives) or
Spark (to incorporate fixes and surface new warnings). For those who are
interested, the sheet was produced with this script: https://gist.github.
com/JoshRosen/1ae12a979880d9a98988aa87d70ff2a8

Depending on the results of this experiment we might want to integrate a
high-signal subset of the Scapegoat warnings into our build. I'm also
hoping that we'll be able to build a useful corpus of triaged warnings in
order to help improve Scapegoat itself and eliminate common false-positives.

Thanks and happy bug-hunting,
Josh Rosen

Reply via email to