Re: GSoC 2026: Extend the static analysis pass

Ridham Khurana via Gcc Wed, 04 Mar 2026 02:13:00 -0800

Hi Dave,

As I was going through both files, I noticed that in the file
*gimple-ssa-sprintf.cc*, directive struct bundles a tree arg directly and
in the file *c-format.cc*, argument parser takes tree first_fillin_param.
But in analyzer part, arguments are const svalue*(via
*call_details::get_arg_svalue()*) and not just tree. So, this means the
shared layer should only handle format string parsing and leave the
argument binding to each corresponding subsystem respectively, which I
think was already implying by your iterator sketch.


I made some rough sketch of the API:

struct format_string_action {
   enum kind { LITERAL, DIRECTIVE } kind; // specifying LITERAL or
DIRECTIVE
   offset, length; // specifying exact position in the raw format string

   // For directives:
   specifier;      // 'd', 's', 'f', etc
   length_mod;     // 'l', 'h', 'L', etc
   flags;          // '-', '+', ' ', '#', '0'
   width;          // value, or can be -1 if unspecified, or '*'
   precision;      // value, or can be -1 if unspecified, or '*'
};

class format_string_iterator{
public:
   bool done_p() const; // check if completely parsed
   void next(); // advance one
   format_string_action get_next_action() const; // just lookup for the
next , not advancing to the next
   format_string_action get_current() const; // getting the current action
from action-list

private:
   const char* raw_string; // original format string is stored here
   const char* pos; // current position in raw_string
   format_string_action latest; // latest parsed in action-list
};

Also, one question about this format_string_action, should it have some
variable containing the number of arguments that are consumed by the
directive? Example, %*.*d consumes 3 (width,precision,value) while %d
consumes only 1. So, should this be stored in the shared layer or the
subsystems should handle it in their own files, when seeing width='*' and
precision='*'?

Best,
Ridham Khurana

On Fri, Feb 20, 2026 at 2:21 AM David Malcolm <[email protected]> wrote:

> On Thu, 2026-02-19 at 16:50 +0530, Ridham Khurana wrote:
> > Hi Dave
>
> Hi Ridham.
>
> Thanks for taking this discussion to the mailing list.
>
> For reference of other people, this is the GSoC proposal relating to PR
> 107017 (adding format string support to -fanalyzer).
>
> > Thanks for the clarifications and the detailed pointers. Here is my
> > understanding of the design architecture and progress so far:
> >
> > I have been going through both files c-format.cc and gimple-ssa-
> > sprintf.cc,
> > and I think I understand the structure/architecture you are
> > suggesting. So,
> > the idea is to introduce a format_parser class (that you mentioned in
> > last
> > email), that takes the raw format string as input and breaks it into
> > different but sequential components, for example literal text
> > segments or
> > directives like %d or %f. I have been calling this sequential
> > structure the
> > "action-map", just to address it with a name. The main point of this
> > shared
> > approach is to avoid having the frontend, GIMPLE and
> > analyzer(upcoming) to
> > run their own logic to decode it, but use this common parser, and
> > instead
> > have them all start from the same parsed action-map.
>
> (nods)
>
> Maybe an "action_list" rather than "action_map", given that it has a
> sequence?  (but that risks a "what color shall we paint the bikeshed"
> discussion)
>
> > The
> > "format_string_iterator" will then provide us an easy and consistent
> > way
> > for different files to walk through the action-map, at the same time
> > make
> > sure the internals of the action-map are not exposed.
>
> (nods)
>
> > From this, every
> > subsystem that requires going through the action-map can use it with
> > its
> > own logic as required, like argument checking in frontend, sprintf
> > folding
> > in GIMPLE, and mainly modelling memory-effects in analyzer, while
> > keeping
> > all the existing semantic algorithms in their own respective files.
>
> A simple way of doing this might be for the shared code to generate a
> std::list<format_action> (or a std::vector), which would be your
> action-map, but another approach might be to eschew an object for the
> action-map, and instead have format_string_iterator provide "actions"
> on demand as clients iterate through the actions, something like:
>
> class format_string_iterator
> {
> public:
>    bool done_p () const;
>    void next ();
>    format_string_action get_next_action () const;
>
> private:
>    /* implementation-specific details here */
> };
>
> or somesuch.
>
> I'm not sure which approach is better.
>
> IIRC, the current implementations in both c-format.cc and gimple-ssa-
> sprintf.cc are structured around iterations, rather than reusing an
> existing container and populating/iterating over that.
>
> >
> > One more thing that I was thinking is, where should the shared
> > format_parser and format_string_iterator classes live in the tree?
> > Right
> > now, c-format.cc is under gcc/c-family/, gimple-ssa-sprintf.cc is
> > directly
> > under gcc, and the analyzer is under gcc/analyzer/. Since the shared
> > code
> > needs to be accessible for all three, I am guessing it should go
> > directly
> > under gcc, but I wanted to confirm if you have some other approach to
> > deal
> > with this before I start thinking about the layout.
>
> I think directly under the gcc/ directory, as part of gcc/Makefile.in's
> OBJs.
>
> Maybe we should introduce a gcc::format_string namespace for the shared
> code to live in???
>
> One thing to remember is that patches need to be reviewable - it's good
> to avoid a huge patch that deletes a big chunk of code and replaces it
> with a different other chunk of code.  So it might be best to build
> this up as a series of patches, expressing refactorings in c-format.cc
> that eventually move the new classes elsewhere (given that IIRC c-
> format.cc currently got the most complicated logic).  But maybe we
> don't need to worry about that during prototyping/experimentation.
>
> A lot of the complexity in c-format.cc is that if we emit a warning, we
> have to do some work to generate source-location information for the
> parts of the format string of interest, but we don't want to do that
> work unless it's going to be needed (e.g. finding the exact location of
> a "%i").  So we'll want the action-map to keep that deferral of work,
> so that we only do it if we issue a diagnostic relating to part of the
> format string.
>
> > On the region_model vs sm_state_map distinction, I think I follow
> > what you
> > mean. I am looking at how both appear in the .dot dumps (e.g. the
> > rmodel:
> > part showing memory state, and the malloc: part showing checker
> > states like
> > unchecked → freed). I am still going through the callbacks to
> > understand
> > how the two sides connect, but I will keep this in mind as I continue
> > studying more about the analyzer internals.
>
> (nods)
>
> >
> > My next step is to continue my deep dive into both implementations
> > and
> > start planning out how the shared parser API could look, depending on
> > where
> > it will live.
>
> Sounds great.
>
> Let me know if you have any questions.
>
> Thanks
> Dave
>
>

Re: GSoC 2026: Extend the static analysis pass

Reply via email to