aaron.ballman added a comment. In D124500#3485924 <https://reviews.llvm.org/D124500#3485924>, @LegalizeAdulthood wrote:
> In D124500#3485462 <https://reviews.llvm.org/D124500#3485462>, @aaron.ballman > wrote: > >> I largely agree, but I've found cases where we'll convert correct code to >> incorrect code, so it's a bit stronger than that. > > Are you talking generally, or with this check? I can't see how this check > is going to generate incorrect code (so far). Specifically this check. >> I think that's a reasonable goal, but we're not meeting the "never ever >> generate invalid code" part. > > How so? Can you give an example where this check will produce invalid code? As posted before: #define FINE 1LL << 30LL #define BAD 1LL << 31LL #define ALSO_BAD 1LL << 32L Now with godbolt goodness: https://godbolt.org/z/Tzbe8qWT5 ================ Comment at: clang-tools-extra/clang-tidy/modernize/IntegralLiteralExpressionMatcher.cpp:99 + + if (!Current->isLiteral() || isStringLiteral(Current->getKind()) || + !isIntegralConstant(*Current)) { ---------------- LegalizeAdulthood wrote: > aaron.ballman wrote: > > LegalizeAdulthood wrote: > > > aaron.ballman wrote: > > > > LegalizeAdulthood wrote: > > > > > aaron.ballman wrote: > > > > > > I know this is code moved from elsewhere, but I suppose we never > > > > > > considered the odd edge case where a user does something like > > > > > > `"foo"[0]` as a really awful integer constant. :-D > > > > > It's always possible to write crazy contorted code and have a check > > > > > not recognize it. I don't think it's worthwhile to spend time trying > > > > > to handle torture cases unless we can find data that shows it's > > > > > prevalent in real world code. > > > > > > > > > > If I was doing a code review and saw this: > > > > > ``` > > > > > enum { > > > > > FOO = "foo"[0] > > > > > }; > > > > > ``` > > > > > I'd flag that in a code review as bogus, whereas if I saw: > > > > > ``` > > > > > enum { > > > > > FOO = 'f' > > > > > }; > > > > > ``` > > > > > That would be acceptable, which is why character literals are > > > > > accepted as an integral literal initializer for an enum in this check. > > > > > I don't think it's worthwhile to spend time trying to handle torture > > > > > cases unless we can find data that shows it's prevalent in real world > > > > > code. > > > > > > > > I think I'm okay agreeing to that in this particular case, but this is > > > > more to point out that writing your own parser is a maintenance burden. > > > > Users will hit cases we've both forgotten about here, they'll file a > > > > bug, then someone has to deal with it. It's very hard to justify to > > > > users "we think you write silly code" because they often have creative > > > > ways in which their code is not actually so silly, especially when we > > > > support "most" valid expressions. > > > Writing your own parser is unavoidable here because we can't just assume > > > that any old thing will be a valid initializer just by looking at the set > > > of tokens present in the macro body. (There is a separate discussion > > > going on about improving the preprocessor support and parsing things more > > > deeply, but that isn't even to the point of a prototype yet.) The worst > > > thing we can do is create "fixits" that produce invalid code. > > > > > > The worst that happens if your expression isn't recognized is that your > > > macro isn't matched as a candidate for an enum. You can always make it > > > an enum manually and join it with adjacent macros that were recognized > > > and converted. > > > > > > As it stands, the check only recognizes a single literal with an optional > > > unary operator. > > > > > > This change expands the check to recognize a broad range of expressions, > > > allowing those macros to be converted to enums. I opened the issue > > > because running modernize-macro-to-enum on the [[ > > > https://github.com/InsightSoftwareConsortium/ITK | ITK codebase ]] showed > > > some simple expressions involving literals that weren't recognized and > > > converted. > > > > > > If an expression isn't recognized and an issue is opened, it will be an > > > enhancement request to support a broader range of expression, not a bug > > > that this check created invalid code. > > > > > > IMO, the more useful thing that's missing from the grammar is recognizing > > > `sizeof` expressions rather than indexing string literals with an > > > integral literal subscript. > > > > > > I had planned on doing another increment to recognize `sizeof` > > > expressions. > > > Writing your own parser is unavoidable here because we can't just assume > > > that any old thing will be a valid initializer just by looking at the set > > > of tokens present in the macro body. > > > > If you ran the token sequence through clang's parser and got an AST node > > out, you'd have significantly *more* information as to whether something is > > a valid enum constant initializer because you can check that it's an actual > > constant expression *and* that it's within a valid range of values. This > > not only fixes edge case bugs with your approach (like the fact that you > > can generate a series of literal expressions that result in a value too > > large to store within an enumerator constant), but it enables new > > functionality your approach currently disallows (like using constexpr > > variables instead of just numeric literals). > > > > So I don't agree that it's unavoidable to write another custom parser. > You keep bringing up the idea that the values have to be known, but so far > they don't. > > Remember, we are replacing macro identifiers with anonymous enum identifiers. > We aren't specifying a restricting type to the enum, so as long as it's a > valid integral literal expression, we're not changing any semantics. > Unscoped enums also allow arbitrary conversions to/from an underlying > integral type chosen by the compiler. > > C++20 9.7.1 paragraph 7 says: > > > For an enumeration whose underlying type is not fixed, the underlying type > > is an integral type that can > > represent all the enumerator values defined in the enumeration. If no > > integral type can represent all the > > enumerator values, the enumeration is ill-formed. It is > > implementation-defined which integral type is used > > as the underlying type except that the underlying type shall not be larger > > than int unless the value of an > > enumerator cannot fit in an int or unsigned int . If the enumerator-list is > > empty, the underlying type is as > > if the enumeration had a single enumerator with value 0. > > So the compiler is free to pick an underlying type that's large enough to > handle all the explicitly listed initial values. Do we actually need to know > the values for this check? I don't think so, because we aren't changing > anything about the type of the named values. When the compiler evaluates an > integral literal, it goes through a similar algorithm assigning the > appropriate type to those integral values: > > C++20 5.9 paragraph 2: > > > A preprocessing number does not have a type or a value; it acquires both > > after a successful conversion to an > > integer-literal token or a floating-point-literal token. > > C++20 5.13.2 paragraph 3: > > > The type of an integer-literal is the first type in the list in Table 8 > > corresponding to its optional integer-suffix > > in which its value can be represented. > > The table says the type is int, unsigned int, long int, unsigned long int, > long long int, or unsigned long long int based on the suffix and the value > and that the type is chosen to be big enough to hold the value if the suffix > is unspecified. > > > but [using `clangParse`] enables new functionality your approach currently > > disallows (like using constexpr variables instead of just numeric literals). > > I agree that if we used the full parser, we'd bring in `constexpr` > expressions as valid initializers for the enums. However, before engaging in > all that work, I'd like to see how likely this is in existing codebases by > feedback from users requesting the support. Maybe engaging the parser isn't > a big amount of work, I don't actually know. I've never looked deeply at the > actual parsing code in clang. Maybe it's easy enough to throw a bag of > tokens at it and get back an AST node, maybe not. (I suspect not based on my > experience with the code base so far.) > > My suspicion is that code bases that are heavy with macros for constants > //aren't// using modern C++ in the body of those macros to define the values > of those constants. Certainly this is 100% true for C code that uses macros > to define constants, by definition. This check applies equally well to C > code as C has had enums forever but even recent C code still tends to use > macros for constants. > > Still, my suspicions aren't data. I'd like to get this check deployed in a > basic fashion and let user feedback provide data on what is important. > > > So I don't agree that it's unavoidable to write another custom parser. > > That's a fair point. //Some// kind of parser is needed to recognize valid > initializer expressions or we run the risk of transforming valid code into > invalid code. Whether it is a custom recognizer as I've done or `clangParse` > is what we're debating here. > You keep bringing up the idea that the values have to be known, but so far > they don't. See comments at the top level. > So the compiler is free to pick an underlying type that's large enough to > handle all the explicitly listed initial values. Do we actually need to know > the values for this check? Yes, C requires the enumeration constants to be representable with `int`. But also, because this is in the `modernize` module, it's very likely we'll be getting a request to convert to using a scoped enumeration or an enumeration with the appropriate fixed underlying type in C++ as well. CHANGES SINCE LAST ACTION https://reviews.llvm.org/D124500/new/ https://reviews.llvm.org/D124500 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits