[PATCH] D124500: [clang-tidy] Support expressions of literals in modernize-macro-to-enum

Aaron Ballman via Phabricator via cfe-commits Mon, 02 May 2022 10:16:13 -0700

aaron.ballman added a comment.

In D124500#3485924 <https://reviews.llvm.org/D124500#3485924>, 
@LegalizeAdulthood wrote:


> In D124500#3485462 <https://reviews.llvm.org/D124500#3485462>, @aaron.ballman 
> wrote:
>
>> I largely agree, but I've found cases where we'll convert correct code to 
>> incorrect code, so it's a bit stronger than that.
>
> Are you talking generally, or with this check?  I can't see how this check
> is going to generate incorrect code (so far).

Specifically this check.

>> I think that's a reasonable goal, but we're not meeting the "never ever 
>> generate invalid code" part.
>
> How so?  Can you give an example where this check will produce invalid code?

As posted before:

  #define FINE 1LL << 30LL
  #define BAD 1LL << 31LL
  #define ALSO_BAD 1LL << 32L

Now with godbolt goodness: https://godbolt.org/z/Tzbe8qWT5



================
Comment at: 
clang-tools-extra/clang-tidy/modernize/IntegralLiteralExpressionMatcher.cpp:99
+
+  if (!Current->isLiteral() || isStringLiteral(Current->getKind()) ||
+      !isIntegralConstant(*Current)) {
----------------
LegalizeAdulthood wrote:
> aaron.ballman wrote:
> > LegalizeAdulthood wrote:
> > > aaron.ballman wrote:
> > > > LegalizeAdulthood wrote:
> > > > > aaron.ballman wrote:
> > > > > > I know this is code moved from elsewhere, but I suppose we never 
> > > > > > considered the odd edge case where a user does something like 
> > > > > > `"foo"[0]` as a really awful integer constant. :-D
> > > > > It's always possible to write crazy contorted code and have a check 
> > > > > not recognize it.  I don't think it's worthwhile to spend time trying 
> > > > > to handle torture cases unless we can find data that shows it's 
> > > > > prevalent in real world code.
> > > > > 
> > > > > If I was doing a code review and saw this:
> > > > > ```
> > > > > enum {
> > > > >     FOO = "foo"[0]
> > > > > };
> > > > > ```
> > > > > I'd flag that in a code review as bogus, whereas if I saw:
> > > > > ```
> > > > > enum {
> > > > >     FOO = 'f'
> > > > > };
> > > > > ```
> > > > > That would be acceptable, which is why character literals are 
> > > > > accepted as an integral literal initializer for an enum in this check.
> > > > >  I don't think it's worthwhile to spend time trying to handle torture 
> > > > > cases unless we can find data that shows it's prevalent in real world 
> > > > > code.
> > > > 
> > > > I think I'm okay agreeing to that in this particular case, but this is 
> > > > more to point out that writing your own parser is a maintenance burden. 
> > > > Users will hit cases we've both forgotten about here, they'll file a 
> > > > bug, then someone has to deal with it. It's very hard to justify to 
> > > > users "we think you write silly code" because they often have creative 
> > > > ways in which their code is not actually so silly, especially when we 
> > > > support "most" valid expressions.
> > > Writing your own parser is unavoidable here because we can't just assume 
> > > that any old thing will be a valid initializer just by looking at the set 
> > > of tokens present in the macro body.  (There is a separate discussion 
> > > going on about improving the preprocessor support and parsing things more 
> > > deeply, but that isn't even to the point of a prototype yet.)  The worst 
> > > thing we can do is create "fixits" that produce invalid code.
> > > 
> > > The worst that happens if your expression isn't recognized is that your 
> > > macro isn't matched as a candidate for an enum.  You can always make it 
> > > an enum manually and join it with adjacent macros that were recognized 
> > > and converted.
> > > 
> > > As it stands, the check only recognizes a single literal with an optional 
> > > unary operator.
> > > 
> > > This change expands the check to recognize a broad range of expressions, 
> > > allowing those macros to be converted to enums.  I opened the issue 
> > > because running modernize-macro-to-enum on the [[ 
> > > https://github.com/InsightSoftwareConsortium/ITK | ITK codebase ]] showed 
> > > some simple expressions involving literals that weren't recognized and 
> > > converted.
> > > 
> > > If an expression isn't recognized and an issue is opened, it will be an 
> > > enhancement request to support a broader range of expression, not a bug 
> > > that this check created invalid code.
> > > 
> > > IMO, the more useful thing that's missing from the grammar is recognizing 
> > > `sizeof` expressions rather than indexing string literals with an 
> > > integral literal subscript.
> > > 
> > > I had planned on doing another increment to recognize `sizeof` 
> > > expressions.
> > > Writing your own parser is unavoidable here because we can't just assume 
> > > that any old thing will be a valid initializer just by looking at the set 
> > > of tokens present in the macro body. 
> > 
> > If you ran the token sequence through clang's parser and got an AST node 
> > out, you'd have significantly *more* information as to whether something is 
> > a valid enum constant initializer because you can check that it's an actual 
> > constant expression *and* that it's within a valid range of values. This 
> > not only fixes edge case bugs with your approach (like the fact that you 
> > can generate a series of literal expressions that result in a value too 
> > large to store within an enumerator constant), but it enables new 
> > functionality your approach currently disallows (like using constexpr 
> > variables instead of just numeric literals).
> > 
> > So I don't agree that it's unavoidable to write another custom parser.
> You keep bringing up the idea that the values have to be known, but so far 
> they don't.
> 
> Remember, we are replacing macro identifiers with anonymous enum identifiers. 
>  We aren't specifying a restricting type to the enum, so as long as it's a 
> valid integral literal expression, we're not changing any semantics.  
> Unscoped enums also allow arbitrary conversions to/from an underlying 
> integral type chosen by the compiler.
> 
> C++20 9.7.1 paragraph 7 says:
> 
> > For an enumeration whose underlying type is not fixed, the underlying type 
> > is an integral type that can
> > represent all the enumerator values defined in the enumeration. If no 
> > integral type can represent all the
> > enumerator values, the enumeration is ill-formed. It is 
> > implementation-defined which integral type is used
> > as the underlying type except that the underlying type shall not be larger 
> > than int unless the value of an
> > enumerator cannot fit in an int or unsigned int . If the enumerator-list is 
> > empty, the underlying type is as
> > if the enumeration had a single enumerator with value 0.
> 
> So the compiler is free to pick an underlying type that's large enough to 
> handle all the explicitly listed initial values.  Do we actually need to know 
> the values for this check?  I don't think so, because we aren't changing 
> anything about the type of the named values.  When the compiler evaluates an 
> integral literal, it goes through a similar algorithm assigning the 
> appropriate type to those integral values:
> 
> C++20 5.9 paragraph 2:
> 
> > A preprocessing number does not have a type or a value; it acquires both 
> > after a successful conversion to an
> > integer-literal token or a floating-point-literal token.
> 
> C++20 5.13.2 paragraph 3:
> 
> > The type of an integer-literal is the first type in the list in Table 8 
> > corresponding to its optional integer-suffix
> > in which its value can be represented. 
> 
> The table says the type is int, unsigned int, long int, unsigned long int, 
> long long int, or unsigned long long int based on the suffix and the value 
> and that the type is chosen to be big enough to hold the value if the suffix 
> is unspecified.
> 
> > but [using `clangParse`] enables new functionality your approach currently 
> > disallows (like using constexpr variables instead of just numeric literals).
> 
> I agree that if we used the full parser, we'd bring in `constexpr` 
> expressions as valid initializers for the enums.  However, before engaging in 
> all that work, I'd like to see how likely this is in existing codebases by 
> feedback from users requesting the support.  Maybe engaging the parser isn't 
> a big amount of work, I don't actually know.  I've never looked deeply at the 
> actual parsing code in clang.  Maybe it's easy enough to throw a bag of 
> tokens at it and get back an AST node, maybe not.  (I suspect not based on my 
> experience with the code base so far.)
> 
> My suspicion is that code bases that are heavy with macros for constants 
> //aren't// using modern C++ in the body of those macros to define the values 
> of those constants.  Certainly this is 100% true for C code that uses macros 
> to define constants, by definition.  This check applies equally well to C 
> code as C has had enums forever but even recent C code still tends to use 
> macros for constants.
> 
> Still, my suspicions aren't data.  I'd like to get this check deployed in a 
> basic fashion and let user feedback provide data on what is important.
> 
> > So I don't agree that it's unavoidable to write another custom parser.
> 
> That's a fair point.  //Some// kind of parser is needed to recognize valid 
> initializer expressions or we run the risk of transforming valid code into 
> invalid code.  Whether it is a custom recognizer as I've done or `clangParse` 
> is what we're debating here.
> You keep bringing up the idea that the values have to be known, but so far 
> they don't.

See comments at the top level.

> So the compiler is free to pick an underlying type that's large enough to 
> handle all the explicitly listed initial values. Do we actually need to know 
> the values for this check? 

Yes, C requires the enumeration constants to be representable with `int`. But 
also, because this is in the `modernize` module, it's very likely we'll be 
getting a request to convert to using a scoped enumeration or an enumeration 
with the appropriate fixed underlying type in C++ as well.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D124500/new/

https://reviews.llvm.org/D124500

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D124500: [clang-tidy] Support expressions of literals in modernize-macro-to-enum

Reply via email to