Re: [PATCH 00/22] RFC: integrated 3rd-party static analysis support

Martin Sebor Sun, 06 Aug 2017 14:22:23 -0700

On 08/04/2017 04:04 PM, David Malcolm wrote:

This patch kit clearly isn't ready yet as-is (see e.g. the
"known unknowns" below), but I'm posting it now in the hope of
getting early feedback.


Summary
=======

This patch kit provides an easy way to make integrate 3rd-party static
analysis tools into gcc, and have them:
(a) report through gcc's diagnostic subsystem, and
(b) "watermark" the generated binaries with queryable data on what checkers
    were run, and what the results were.


I couldn't agree more that static analysis is often done too late,
and that running it before code is checked in is ideal.  At my last
job, running static analysis was required as part of the commit
criteria for every change.  The tooling made sure that each change
was analyzed and defect-free before it was committed and would
reject changes that failed this test.  (A full run on the whole
product would happen on a much less frequent schedule.)

As for this particular project, I'm somewhat divided.  On the one
hand, implementing these kinds of enhancements is an opportunity
to clean up and generalize the existing (internal) design, and
expose latent bugs in the process.  On the other, adding all this
machinery complicates the already complex code base.  Unless it's
fully exercised as part of everyday GCC development it also runs
the risk of bit rotting and further increasing GCC maintenance.

That said, GCC having its own static analyzer (or some other such
tool) that were used as the default implementation with this
feature to fully exercise it during bootstrap would largely
obviate this concern.  It would still require maintenance but
at least it would be fully tested.

As for adding own implementation of JSON (and other components),
to minimize the maintenance costs, I would far prefer to introduce
a dependency on a well-tested and actively maintained third party
library.

Martin


Here's an example showing gcc running a bank of 3rd-party checkers on this
source file:

  #include <stdlib.h>

  void test ()
  {
    void *ptr_1;
    void *ptr_2;

    ptr_1 = malloc (64);
    if (!ptr_1)
      return;
    ptr_2 = malloc (64);
    if (!ptr_2)
      return;

    free (ptr_2);
    free (ptr_1);
  }

via a simple command-line:

  $ ./xgcc -B. -c conditional-leak.c -Wrun-analyzers=policy.json
  conditional-leak.c:13:5: error: Potential leak of memory pointed to by 
'ptr_1' [clang-analyzer:Memory leak]
       return;
       ^
  conditional-leak.c:8:11: note: state 1 of 4: Memory is allocated
     ptr_1 = malloc (64);
             ^
  conditional-leak.c:9:7: note: state 2 of 4: Assuming 'ptr_1' is non-null
     if (!ptr_1)
         ^
  conditional-leak.c:12:7: note: state 3 of 4: Assuming 'ptr_2' is null
     if (!ptr_2)
         ^
  conditional-leak.c:13:5: note: state 4 of 4: Potential leak of memory pointed 
to by 'ptr_1'
       return;
       ^
  conditional-leak.c:13:0: error: Memory leak: ptr_1 [cppcheck:memleak]
       return;

Of the checkers, clang's static analyzer and cppcheck both identify the
memory leak; the former also identifies the control flow (the other
checkers didn't report anything).

The idea is to provide a mechanism to make it easy for developers and
projects to impose policy on what checkers should be run, and to gate
the build if certain tests fail.

In this case, the results are treated as hard errors and block the build,
but policy could allow them to be warnings.

Extensive metadata is captured about what checkers were run, and what
they emitted, using the "Firehose" interchange format:

  http://firehose.readthedocs.io/en/latest/index.html

In the case where this doesn't block the build, this can be queried via a
  contrib/get-static-analysis.py
script, so e.g. you can verify that a setuid binary was indeed compiled
using all the checkers that you expect it to be.

This can also be used to embed data about the code into the watermark.
For example, checkers/ianal.py embeds information about "Copyright"
lines in the source code into the generated binaries, from where it
can be queried (this example is intended as a proof-of-concept rather
than as a real license-tracking solution...)


Statement of the problem
========================

Static analysis is IMHO done too late, if at all: static analysis tools are run
as an optional extra, "on the side", rather than in developers' normal
workflow, with some kind of "override the compiler and do extra work" hook,
which may preclude running more than one analyzer at once.  Analysis results
are reviewed (if at all) in some kind of on-the-side tool, rather than when the
code is being edited, or patches being prepared.

It would be better to have an easy way for developers to run analyzer(s)
as they're doing development, as part of their edit-compile-test cycle
- analysis problems are reported immediately, and can be acted on
immediately (e.g. by treating some checker tests as being hard errors).

It would also be good to have a way to run analyzer(s) when packages are
built, with a variety of precanned policies for analyzers.  For example,
setuid binaries and network-facing daemons could each be built with a
higher strictness of checking.

It would also be good to tag binaries with information on what analyzers
were run, what options they were invoked with, etc.
Potentially have "dump_file" information from optimization passes stored
in the metadata also.   Have a tool to query all of this.

This way a distribution can perform a query like:

  "show me all setuid binaries that contain code that wasn't checked
   with $CHECKER with $TEST set to be a hard error"

Can/should we break the build if there are issues?
Yes: but have a way to opt-in easily: if the tool is well-integrated with the
    compiler: e.g.
        -Wrun-analyzers=/usr/share/analyzers/userspace/network-facing-service
then upstream developers and packagers can turn on the setting, and see what
breaks, and fix it naturally within an compile-edit-test cycle

This gives a relatively painless way to opt-in to increasing levels of
strictness (e.g. by an upstream project, or by an individual developer).

Does this slow the build down?
Yes: but you can choose which analyzers run, and can choose to turn them off.
It ought to parallelize well.  I believe users will prefer to turn them on,
and have builders burn up the extra CPU cycles.
This may make much more sense for binary distributions (e.g. Fedora, Debian)
that it does for things like Gentoo.

Example policy files/options might be:
  -Wrun-analyzers=/usr/share/analyzers/userspace/network-facing-service
  -Wrun-analyzers=/usr/share/analyzers/userspace/application
  -Wrun-analyzers=/usr/share/analyzers/userspace/setuid-binary
  -Wrun-analyzers=/usr/share/analyzers/userspace/default
  -Wrun-analyzers=/usr/share/analyzers/kernel

or whatnot.

Idea is to provide mechanism, and for the distribution to decide on some
standard policies.

This may also allow us to sandbox a gcc plugin by running the plugin inside
another cc1, for plugins that add warnings - if the plugin ICEs, then the main
cc1 isn't affected (useful for doing mass rebuilds of code using an
experimental plugin).


Known unknowns
==============

How does one suppress a specific false-positive site?
Do we need a pragma for it?  (though pragmas ought to already affect some of
the underlying checkers...)

Do we really want .json for the policy format?
If we're expecting users to edit this, we need great error messages,
and probably support for comments.  Would YAML or somesuch be better?
Or have them as individual command-line flags, and the policy files are
"@" files for gcc.

How to mark which checkers are appropriate for which languages?

(etc; see also all the FIXMEs in the code...)


Dependencies
============

The "checkers" subdirectory uses Python 2 or 3, and has a few Python
dependencies, including "firehose" and "gccinvocation".


How it works
============

If enabled, toplev.c starts each of the various checkers from separate
threads from near the start of toplev.c, so that the checkers run in
parallel with each other, and with the bulk of cc1.  Near the end of
toplev.c it waits for each thread to finish, and reads the stdout,
which is expected to be in Firehose JSON format.  This is then sent
through the diagnostic subsystem.

Each "checker" is a harness script, which "knows" how to invoke
the particular 3rd-party tool, and coerce the output from the tool
into the common JSON format.

Some notes on the data model can be seen here:
  http://firehose.readthedocs.io/en/latest/data-model.html
(though that's expressed as Python objects and XML, rather than
the JSON format).


Successfully bootstrapped&regrtested the combination of the patches
on x86_64-pc-linux-gnu (though the only testcases are selftest based
unit-tests, rather than DejaGnu tests).


Thoughts?
Dave


David Malcolm (22):
  Expose assert_loceq outside of input.c; add ASSERT_LOCEQ
  libcpp: add linemap_position_for_file_line_and_column
  Add JSON implementation
  Add firehose.h/cc
  diagnostic.c/h: add support for external tools
  Makefile.in: hack in -lpthread
  Add minimal version of Nick Clifton's annobin code
  Add GNU_BUILD_ATTRIBUTE_STATIC_ANALYSIS to annobin.h
  Add selftest::read_file (..., FILE *, ...)
  Add checkers.h/cc
  Add checkers/test-sources
  Add -Wrun-analyzers= to common.opt, toplev.c, and invoke.texi
  Add checkers/checker.py
  Add checkers/always_fails.py
  Add checkers/clang_analyzer.py
  Add checkers/coverity.py
  Add checkers/cppcheck.py
  Add checkers/flawfinder.py
  Add checkers/ianal.py
  Add checkers/splint.py
  Add checkers/Makefile
  Add contrib/get-static-analysis.py

 checkers/ChangeLog                                 |    9 +
 checkers/Makefile                                  |   23 +
 checkers/always_fails.py                           |   57 +
 checkers/checker.py                                |  367 ++++
 checkers/clang_analyzer.py                         |  145 ++
 checkers/coverity.py                               |  141 ++
 checkers/cppcheck.py                               |  138 ++
 checkers/flawfinder.py                             |  124 ++
 checkers/ianal.py                                  |   79 +
 checkers/splint.py                                 |   77 +
 checkers/test-sources/conditional-leak.c           |   17 +
 checkers/test-sources/cpychecker-demo.c            |  110 ++
 checkers/test-sources/divide-by-zero.c             |    4 +
 checkers/test-sources/harmless.c                   |    9 +
 checkers/test-sources/multiple-1.c                 |    6 +
 checkers/test-sources/multiple-2.c                 |    9 +
 checkers/test-sources/out-of-bounds.c              |    6 +
 checkers/test-sources/read-through-null.c          |    4 +
 checkers/test-sources/return-of-stack-address.c    |    6 +
 checkers/test-sources/unconditional-file-leak.c    |   10 +
 contrib/get-static-analysis.py                     |   47 +
 gcc/Makefile.in                                    |    7 +-
 gcc/annobin.cc                                     |  185 ++
 gcc/annobin.h                                      |   45 +
 gcc/checkers.cc                                    |  736 ++++++++
 gcc/checkers.h                                     |   26 +
 gcc/common.opt                                     |    4 +
 gcc/diagnostic-show-locus.c                        |   29 +-
 gcc/diagnostic.c                                   |   85 +-
 gcc/diagnostic.h                                   |    5 +
 gcc/doc/invoke.texi                                |    8 +-
 gcc/firehose.cc                                    |  709 ++++++++
 gcc/firehose.h                                     |  199 ++
 gcc/input.c                                        |   71 +-
 gcc/json.cc                                        | 1914 ++++++++++++++++++++
 gcc/json.h                                         |  214 +++
 gcc/selftest-diagnostic.h                          |   62 +
 gcc/selftest-input.h                               |   54 +
 gcc/selftest-run-tests.c                           |    3 +
 gcc/selftest.c                                     |   16 +-
 gcc/selftest.h                                     |   10 +
 .../checker-output/test-clang-analyzer.json        |  122 ++
 .../selftests/checker-output/test-cppcheck.json    |   50 +
 .../selftests/checker-output/test-failure.json     |   38 +
 .../selftests/checker-policy/test-policy.json      |    7 +
 gcc/toplev.c                                       |    9 +
 libcpp/include/line-map.h                          |    9 +
 libcpp/line-map.c                                  |   51 +
 48 files changed, 6001 insertions(+), 55 deletions(-)
 create mode 100644 checkers/ChangeLog
 create mode 100644 checkers/Makefile
 create mode 100755 checkers/always_fails.py
 create mode 100755 checkers/checker.py
 create mode 100755 checkers/clang_analyzer.py
 create mode 100644 checkers/coverity.py
 create mode 100755 checkers/cppcheck.py
 create mode 100755 checkers/flawfinder.py
 create mode 100755 checkers/ianal.py
 create mode 100755 checkers/splint.py
 create mode 100644 checkers/test-sources/conditional-leak.c
 create mode 100644 checkers/test-sources/cpychecker-demo.c
 create mode 100644 checkers/test-sources/divide-by-zero.c
 create mode 100644 checkers/test-sources/harmless.c
 create mode 100644 checkers/test-sources/multiple-1.c
 create mode 100644 checkers/test-sources/multiple-2.c
 create mode 100644 checkers/test-sources/out-of-bounds.c
 create mode 100644 checkers/test-sources/read-through-null.c
 create mode 100644 checkers/test-sources/return-of-stack-address.c
 create mode 100644 checkers/test-sources/unconditional-file-leak.c
 create mode 100644 contrib/get-static-analysis.py
 create mode 100644 gcc/annobin.cc
 create mode 100644 gcc/annobin.h
 create mode 100644 gcc/checkers.cc
 create mode 100644 gcc/checkers.h
 create mode 100644 gcc/firehose.cc
 create mode 100644 gcc/firehose.h
 create mode 100644 gcc/json.cc
 create mode 100644 gcc/json.h
 create mode 100644 gcc/selftest-diagnostic.h
 create mode 100644 gcc/selftest-input.h
 create mode 100644 
gcc/testsuite/selftests/checker-output/test-clang-analyzer.json
 create mode 100644 gcc/testsuite/selftests/checker-output/test-cppcheck.json
 create mode 100644 gcc/testsuite/selftests/checker-output/test-failure.json
 create mode 100644 gcc/testsuite/selftests/checker-policy/test-policy.json

Re: [PATCH 00/22] RFC: integrated 3rd-party static analysis support

Reply via email to