Looks like a whole lot of the results have been analyzed. I suspect there's more than enough to act on already. I think we should wait until after 2.2 is done. Anybody prefer how to proceed here -- just open a JIRA to take care of a batch of related types of issues and go for it?
On Sat, Jun 17, 2017 at 4:45 PM Hyukjin Kwon <gurwls...@gmail.com> wrote: > Gentle ping to dev for help. I hope this effort is not abandoned. > > > On 25 May 2017 9:41 am, "Josh Rosen" <joshro...@databricks.com> wrote: > > I'm interested in using the Scapegoat > <https://github.com/sksamuel/scapegoat> Scala compiler plugin to find > potential bugs and performance problems in Spark. Scapegoat has a useful > built-in set of inspections and is pretty easy to extend with custom ones. > For example, I added an inspection to spot places where we call *.apply()* on > a Seq which is not an IndexedSeq > <https://github.com/sksamuel/scapegoat/pull/159> in order to make it > easier to spot potential O(n^2) performance bugs. > > There are lots of false-positives and benign warnings (as with any linter > / static analyzer) so I don't think it's feasible to us to include this as > a blocking step in our regular build. I am planning to build tooling to > surface only new warnings so going forward this can become a useful > code-review aid. > > The current codebase has roughly 1700 warnings that I would like to triage > and categorize as false-positives or real bugs. I can't do this alone, so > here's how you can help: > > - Visit the Google Docs spreadsheet at > > https://docs.google.com/spreadsheets/d/1z7xNMjx7VCJLCiHOHhTth7Hh4R0F6LwcGjEwCDzrCiM/edit?usp=sharing > and > find an un-triaged warning. > - In the columns at the right of the sheet, enter your name in the > appropriate column to mark a warning as a false-positive or as a real bug > and/or performance issue. If think a warning is a real issue then use the > "comments" column for providing additional detail. > - Please don't file JIRAs or PRs for individual warnings; I suspect > that we'll find clusters of issues which are best fixed in a few larger PRs > vs. lots of smaller ones. Certain warnings are probably simply style issues > so we should discuss those before trying to fix them. > > The sheet has hidden columns capturing the Spark revision and Scapegoat > revision. I can use this to programmatically update the sheet and remap > lines after updating either Scapegoat (to suppress false-positives) or > Spark (to incorporate fixes and surface new warnings). For those who are > interested, the sheet was produced with this script: > https://gist.github.com/JoshRosen/1ae12a979880d9a98988aa87d70ff2a8 > > Depending on the results of this experiment we might want to integrate a > high-signal subset of the Scapegoat warnings into our build. I'm also > hoping that we'll be able to build a useful corpus of triaged warnings in > order to help improve Scapegoat itself and eliminate common false-positives. > > Thanks and happy bug-hunting, > Josh Rosen > > >