Re: [PATCH] [RFC] Delayed parsing for bounds safety attributes

Martin Uecker Sat, 26 Jul 2025 01:05:52 -0700

Am Freitag, dem 25.07.2025 um 20:07 -0700 schrieb Bill Wendling:
> On Thu, Jul 24, 2025 at 2:53 PM Martin Uecker <ma.uec...@gmail.com> wrote:
> > Am Donnerstag, dem 24.07.2025 um 14:25 -0700 schrieb Bill Wendling:
> > > On Thu, Jul 24, 2025 at 8:03 AM Martin Uecker <ma.uec...@gmail.com> wrote:
> > > > Am Donnerstag, dem 24.07.2025 um 14:08 +0000 schrieb Aaron Ballman:
> > > > > On Wed, Jul 23, 2025 at 8:38 PM Martin Uecker <ma.uec...@gmail.com> 
> > > > > wrote:
> > > > TBH, I am not terrible convinced about this argument.
> > > > 
> > > > If I understood it correctly, the late parsing design seems to make
> > > > no distinctions between which identifiers is used, the local or
> > > > the global one and just prefers the local one if it exists, possibly
> > > > giving a warning if there is also a global one.
> > > 
> > > Yes...kinda. The order of name lookup would essentially be: field
> > > within a struct, any non-global scope, global scope. (As Kees pointed
> > > out, there would be times when we need to support function calls and
> > > counts in sub-structs, but those are handled by this convention.) The
> > > only part of this ordering that *isn't* part of normal C identifier
> > > resolution is the "field within a struct" part.
> > > 
> > The other big thing were it diverges from normal C identifier lookup
> > (and also C++ for most cases) which you miss in the description above,
> > is that it would pick up an identifier that comes later in the scope.
> > 
> > The following code prints 10 and not 20.  I think this is the much
> > bigger and more severe divergence.
> > 
> > int main()
> > {
> >     int n = 10;
> >     {
> >         printf("%d\n", n);
> >         int n = 20;
> >     }
> > }
> 
> I brought this exact argument up early on in the Bounds Safety RFC
> phase and was assured that none of the developers were confused about
> name resolutions.


There are different opinions about this.  But all what I wanted to point
out is that the new lookup rules would go against the current C's name
lookup rules.  

I will try to explain a bit more why I think this is an issue for C.

Let me start by saying that I totally get that

struct foo {
  char *buf __counted_by(size * 4);
  size_t size;
};

looks nice and is very intuitive by itself.   But the simple examples
are not what worry me.  For example, is entirely different in more
complex examples, e.g. when all of it expanded from a macro.

#define FOO(element_size)                               \
struct foo {                                            \
  char *buf __counted_by(size * (element_size));        \
        /* a lot of other things */                     \
  size_t size;
};

Suddenly you have to worry about name collision of anything
contained in the macro argument with something which may occur
inside the structure.

Now, if all you care is annotating shared C / C++ headers, then you may
not worry about this too much and find that the rules work perfectly
for you.  Which probably epxlains why all C++ programmers are perfectly
happy with this.

But if you write C code it is an entirely different story, then you
have to worry about this. And such code is common in C.

https://gitlab.com/nbdkit/nbdkit/-/blob/master/common/utils/vector.h


We have the same problem in macros using statement expressions

#define FOO(arg)                \
do {                            \
   int x = (arg);               \
} while(1);

int x;
FOO(x);         // boom.

But in macros you can work around it by renaming the local variables
to something unusualy (or even by generating unique identifiers using
__COUNTER__).   How do you do this when annotating an existing API
with counted_by?   You can't.


>  I don't know what else to tell you except that
> resolving to the struct first is the exact behavior we've been trying
> to get correct between GCC and Clang. All arguments for making the
> name resolution more explicit have been more-or-less shrugged off. And
> even if we adopted the dot-notation (which I'm not against doing), we
> would *still* need some form of delayed parsing.

Only if we adopted the dot syntax while also allowing arbitrary 
C expressions.

We can also define a sub language for bounds that can be parsed without
knowing the types of the variables.

For example, we could say we allow only expressions of he form

.N + offset

where all constants and variables are always converted to size_t but 
with overflow being a run-time error.    

> 
> > > The question about
> > > whether or not this would cause "confusion" to C programmers isn't
> > > completely settled, however Apple says that they have a lot of users
> > > and have yet to run into anyone who was confused by it. While just
> > > anecdotal evidence, it's a good indicator that people would use the
> > > feature "correctly."
> > 
> > I hope I get to see some more information about the context the
> > data Apple has.
> > 
> > But the story of my life at the moment is about disagreeing with
> > C++ programmers who tell me how C is written but who actually only
> > often have some very imited experience writing C.  So I generally
> > be sceptical about statements that do not match my experience.
> > (I use size expressions in C in prototypes a *lot*, so all this
> > talk about how this is error prone simply does not match my
> > personal experience.)
> 
> If you look back on all of the discussions, I'll see that I agree with
> this sentiment. I vastly prefer things to be explicit rather than
> implicit.
> 
> > > > I think it is generally a challenge to support.  One could certainly
> > > > store away the tokens and parse them later (this is certainly doable),
> > > > but it adds a lot of issues because you need to add a lot of constraints
> > > > for things which should then not be allwoed.  And it is still not an
> > > > acceptable solution for size arguments in C.
> > > > 
> > > > .N would work here if you combine with a rule such as ".N" is always
> > > > converted to "size_t".   Or you require an explicit cast if is different
> > > > to "size_t" .
> > > 
> > > Does this mean that the example above would be treated essentially like:
> > > 
> > >   void func(char *buffer __counted_by((size_t).N * sizeof((size_t).N)), 
> > > int N);
> > > 
> > > ?
> > 
> > Nobody has devised a full specification for .N yet, that would support
> > arbitrary expressions     There are several possibilities.  Treating it as
> > size_t would be one option, but probably not a very good one.
> > 
> > One option which might be attractive is to treat it as int which would
> > cover all types that have integer promotion, and require the user to add
> > a cast if it comes later and is something else.
> > 
> > One question here is also if you really want an unsigned type when
> > you compute a bound, because you might get wraparound.  This makes
> > we wonder how this expressions would look like in practice.  Maybe
> > you casts anyway.
> > 
> > The original idea was to first support it only for a single
> > identifier [.N] and maybe just special expressions such as [.N + 3]
> > in which case the type may not be relevant because you want to treat
> > this specially anyway.
> 
> Yes and no. We started with a single identifier, but the idea of using
> expressions more complex than '.N + 3' was always the goal. And, from
> my understanding, Clang does support that. All of the work we've been
> doing has been to support the expression stuff, and that's really my
> main focus for this RFC; specifically expressions within the attribute
> used in structs. Whether this RFC could be used for parameters has yet
> to be seen (I suspect that it could, but would be more invasive).

The original idea in WG14.  We have been discussing these things for
quite a while.  But I agree what we need some expressions in structures.

But the key question is: Do we need to invoke the full language parser?

To me it seems that we would need to restrict this very much, because
I do not think we want to allow evaluation of arbitrary expressions
on each structure access anyhow. 

A small sub language for bounds annotation seems to be an entirely
reasonable approach to me, and we have similar heavily constrained
sublanguages in C already, e.g. for address constants (which might
need to be passed down to the linker).

Martin

Re: [PATCH] [RFC] Delayed parsing for bounds safety attributes

Reply via email to