On Sat, 2019-11-16 at 21:42 +0100, Thomas Schwinge wrote:
> Hi David!
> 
> On 2019-11-15T20:22:47-0500, David Malcolm <dmalc...@redhat.com>
> wrote:
> > This patch kit
> 
> (I have not looked at the patches.)  ;-)
> 
> > introduces a static analysis pass for GCC that can diagnose
> > various kinds of problems in C code at compile-time (e.g. double-
> > free,
> > use-after-free, etc).
> 
> Sounds great from the description!

Thanks.

> Would it make sense to add to the wiki page
> <https://gcc.gnu.org/wiki/DavidMalcolm/StaticAnalyzer> a (high-level)
> comparison to other static analyzers (Coverity, cppcheck,
> clang-static-analyzer, others?), in terms of how they work, what
> their
> respective benefits are, what their design goals are, etc.  (Of
> course
> understanding that yours is much less mature at this point; talking
> about
> high-level design rather than current implementation status.)
> 
> For example, why do we want that in GCC instead of an external tool
> -- in
> part covered in your Rationale.  Can a compiler-side implementation
> benefit from having more information available than an external tool?
> GCC-side implementation is readily available (modulo GCC plugin
> installation?) vs. external ones need to be installed/set up first.
> GCC-side one only works with GCC-supported languages.  GCC-side one
> analyzes actual code being compiled -- thinking about preprocessor-
> level
> '#if' etc., which surely are problematic for external tools that are
> not
> actually replicating a real build.  And so on.  (If you don't want to
> spell out Coverity, cppcheck, clang-static-analyzer, etc., maybe just
> compare yours to external tools.)
> 
> Just an idea, because I wondered about these things.

Thanks; I've added some notes to the "Rationale" section of the wiki
page.

A lot of the information you're after is hidden in patch 2 of the kit,
in an analysis.texi (though admittedly that's hard to read in "patch
that adds a .texi file" form).

For now, I've uploaded a prebuilt version of the HTML to:

https://dmalcolm.fedorapeople.org/gcc/2019-11-19/gccint/Static-Analyzer.html


> > The analyzer runs as an IPA pass on the gimple SSA representation.
> > It associates state machines with data, with transitions at certain
> > statements and edges.  It finds "interesting" interprocedural paths
> > through the user's code, in which bogus state transitions happen.
> > 
> > For example, given:
> > 
> >    free (ptr);
> >    free (ptr);
> > 
> > at the first call, "ptr" transitions to the "freed" state, and
> > at the second call the analyzer complains, since "ptr" is already
> > in
> > the "freed" state (unless "ptr" is NULL, in which case it stays in
> > the NULL state for both calls).
> > 
> > Specific state machines include:
> > - a checker for malloc/free, for detecting double-free, resource
> > leaks,
> >   use-after-free, etc (sm-malloc.cc), and
> 
> I can immediately see how this can be useful for a bunch of
> 'malloc'/'free'-like etc. OpenACC Runtime Library calls as well as
> source
> code directives.  ..., and this would've flagged existing code in the
> libgomp OpenACC tests, which recently has given me some grief. Short
> summary/examples:
> 
> In addition to host-side 'malloc'/'free', there is device-side
> (separate
> memory space) 'acc_malloc'/'acc_free'. 

I've been thinking about generalizing the malloc/free checker to cover
resource acquisition/release pairs, adding a "domain" for the
allocation, where we'd complain if the resource release function isn't
of the same domain as the resource acquisition function.

Allocation domains might be:
  malloc/free
  C++ scalar new/delete
  C++ array new/delete
  FILE * (fopen/fclose)
  "foo_alloc"/"foo_release" for libfoo (i.e. user-extensible, via
attributes)

and thus catch things like deleting with scalar delete when the buffer
was allocated using new[], and various kinds of layering violations.

I'm afraid that I'm not very familiar with OpenACC.  Would
acc_malloc/acc_free fit into that pattern, or would more be needed? 
For example, can you e.g. dereference a device-side pointer in host
code, or would we ideally issue a diagnostic about that?

>  Static checking example: don't
> mix up host-side and device-side pointers.  (Both are normal C/C++
> pointers.  Hmm, maybe such checking could easily be implemented even
> outside of your checker by annotating the respective function
> declarations with an attribute describing which in/out arguments are
> host-side vs. device-side pointers.)
> 
> Then, there are functions to "map" host-side and device-side memory:
> 'acc_map_data'/'acc_unmap_data'.  Static checking example: you must
> not
> 'acc_free' memory spaces that are still mapped.

It sounds like this state machine is somewhat more complicated.

Is there a state transition diagram for this somewhere?  I don't have
that for my state machines, but there are at least lists of states; see
e.g. the various state_t within malloc_state_machine
near the top of:
https://gcc.gnu.org/ml/gcc-patches/2019-11/msg01539.html

> Then, there are functions like 'acc_create' (or equivalent directives
> like '#pragma acc create') doing both 'acc_malloc', 'acc_map_data'
> (plus/depending on internal reference counting).  Static checking
> example: for a pointer returned by 'acc_create" etc., you must use
> 'acc_delete' etc. instead of 'acc_free', which first does
> 'acc_unmap_data' before interal 'acc_free' (..., and again all that
> depending on reference counting).  (Might be "interesting" to teach
> your
> checker about the reference counting -- if that is actually
> necessary;
> needs further thought.)

Perhaps acc_create/acc_delete could simply be handled as a different
allocation "domain" in the approach I suggested above.

Reference counting is actually how I got into GCC development, by
writing a static analysis pass for verifying reference counts in
CPython extension modules (as part of the GCC Python plugin):
https://gcc-python-plugin.readthedocs.io/en/latest/cpychecker.html
(sadly, that code is slowly bit-rotting).

The approach taken in that python plugin code has various issues. 
Plus, doing it in Python isn't fast, and it's really helpful having
compile-time type checking while writing an analyzer; doing this one in
C++ has been a big improvement.

> > The checker is implemented as a GCC plugin.
> > 
> > The patch kit adds support for "in-tree" plugins i.e. GCC plugins
> > that
> > would live in the GCC source tree and be shipped as part of the GCC
> > tarball,
> > with a new:
> >   --enable-plugins=[LIST OF PLUGIN NAMES]
> > configure option, analogous to --enable-languages (the
> > Makefile/configure
> > machinery for handling in-tree GCC plugins is adapted from how we
> > support
> > frontends).
> 
> I like that.  Implementing this as a plugin surely must help to
> either
> document the GCC plugin interface as powerful/mature for such a
> task.  Or
> make it so, if it isn't yet.  ;-)

Our plugin "interface" as such is very broad.

> > The default is for no such plugins to be enabled, so the default
> > would
> > be that the checker isn't built - you'd have to opt-in to building
> > it,
> > with --enable-plugins=analyzer
> 
> I'd favor a default of '--enable-plugins=default' which enables the
> "usable" plugins.

> > It's not clear to me whether I should focus on:
> > 
> > (a) pruning the scope of the checker so that it works well on
> > *intra*procedural C examples (and bail on anything more complex),
> > perhaps
> > targetting GCC 10 as an optional extra hidden behind
> > --enable-plugins=analyzer, or
> > 
> > (b) work on deeper interprocedural analysis (and fixing efficiency
> > issues
> > with this).
> 
> I personally would be happy to see (a) happen now (without "optional
> extra hidden behind" configure-time flag but rather hidden behind GCC
> command-line flag, '-fanalyze'?), and then (b) later on.  As
> always, doing the incremental thing, (a) first, then (b) later, would
> give it more exposure in the wild, which should help to identify
> design
> issues etc. now instead of much later, for example.

I'm continuing to work on this code; I've already fixed the biggest
issue with LTO support (in my local copy).

I wonder if I ought to hide the interprocedural stuff behind an option
for the initial release, or only handle the simplest wrappers around
malloc/free.  Though obviously I'd prefer to get it working.  I guess
the thing to do for now is to fix bugs, and test it on increasingly
larger codebases.

> > Thoughts?
> 
> One very high-level item: you're using the very generic name
> "analyzer"
> ('-Wanalyzer', 'gcc/analyzer/' filenames, for example, or the '-
> fanalyze'
> I just proposed).  

I like "-fanalyze".

> Might there be potential for confusion (now, or in the
> future) what kind of analyzer that is, or is it safe to assume that
> in
> context of a compiler, an analyzer is always a compile-time, static
> code
> analyzer?  After all, the existing run-time ones are known as
> "checkers":
> '-fstack-check', or "sanitizers": '--fsanitize, or "verifiers":
> '-fvtable-verify'.

My hope is that long-term the analyzer will be extensible: that clang-
static-analyzer works this way, where there's a general framework, with
various "checkers" within it, potentially themselves as plugins.

For an initial version it's simplest to hardcode the checkers, and only
add generality and extension points once we have something concrete and
useful working.

For naming, I've used "analyzer", but I now realize I've conflated two
things:
  (a) the subdirectory we use internally, and the name for the
configure-time enablement option
  (b) what the user types in command-line options (warning names etc)

Potentially (a) could be "sa" (for static analyzer) rather than
"analyzer", but that might be an abbreviation too far.

Hopefully "analyzer" is OK as a name for now.  I plan to create an
ongoing git branch, probably "dmalcolm/analyzer" (I've got some fixes
queued up in my local copy).

> 
> Grüße
>  Thomas

Thanks for the feedback
Dave


Reply via email to