tahonermann added inline comments.
================ Comment at: clang/include/clang/AST/Expr.h:1846-1871 + std::string getStringAsChar() const { + assert(getCharByteWidth() == 1 && + "This function is used in places that assume strings use char"); + return std::string(getTrailingObjects<char>(), getTrailingObjects<char>() + getByteLength()); + } + + std::u16string getStringAsChar16() const { ---------------- MarcusJohnson91 wrote: > cor3ntin wrote: > > aaron.ballman wrote: > > > One potential issue to this is that in C++, these types are defined to be > > > UTF-16 and UTF-32, whereas in C, that isn't always the case. Currently, > > > Clang defines `__STDC_UTF_16__` and `__STDC_UTF_32__` on all targets, but > > > we may need to add some sanity checks to catch if a target overrides that > > > behavior. Currently, all the targets in Clang are keeping that promise, > > > but I have no idea what shenanigans downstream users get up to and > > > whether their targets remove the macro definition or define it to `0` > > > instead of `1`. > > Is it possible that the host and target wchar_t have a different size here? > I've honestly been wondering how Clang handled that, in the codebase vs at > runtime myself for a while. WG14 N2728 (http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2728.htm) proposes specifying that `char16_t` and `char32_t` string literals are UTF-16 and UTF-32 encoded respectively. SG16 previously tried to identify any implementations that use a different encoding for these and was unable to find one. I think it is safe to only support these encodings for `u` and `U` prefixed string literals. ================ Comment at: clang/lib/AST/ScanfFormatString.cpp:344-347 + case LengthModifier::AsUTF16: + return ArgType(ArgType::Char16Ty); + case LengthModifier::AsUTF32: + return ArgType(ArgType::Char32Ty); ---------------- Should these additions not include a `PtrTo` wrapper as is done for other types? ================ Comment at: clang/lib/AST/Type.cpp:1980-2002 +bool Type::isChar16Type(const LangOptions &LangOpts) const { if (const auto *BT = dyn_cast<BuiltinType>(CanonicalType)) - return BT->getKind() == BuiltinType::Char16; - return false; -} - -bool Type::isChar32Type() const { - if (const auto *BT = dyn_cast<BuiltinType>(CanonicalType)) - return BT->getKind() == BuiltinType::Char32; - return false; -} - -/// Determine whether this type is any of the built-in character -/// types. -bool Type::isAnyCharacterType() const { - const auto *BT = dyn_cast<BuiltinType>(CanonicalType); - if (!BT) return false; - switch (BT->getKind()) { - default: return false; - case BuiltinType::Char_U: - case BuiltinType::UChar: - case BuiltinType::WChar_U: - case BuiltinType::Char8: - case BuiltinType::Char16: - case BuiltinType::Char32: - case BuiltinType::Char_S: - case BuiltinType::SChar: - case BuiltinType::WChar_S: - return true; + if (BT->getKind() == BuiltinType::Char16) + return true; + if (!LangOpts.CPlusPlus) { + return isType("char16_t"); } ---------------- MarcusJohnson91 wrote: > aaron.ballman wrote: > > MarcusJohnson91 wrote: > > > aaron.ballman wrote: > > > > If I understand properly, one issue is that `char16_t` is defined by C > > > > to be *the same type* as `uint_least16_t`, which itself is a typedef to > > > > some other integer type. So we can't make `char16_t` be a distinct type > > > > in C, it has to be a typedef to a typedef to an integer type. > > > > > > > > One concern I have about the approach in this patch is that it makes > > > > the type system a bit more convoluted. This type is *not* really a > > > > character type in C, it's a typedef type that playacts as a character > > > > type. I think this is a pretty fundamental change to the type system in > > > > the compiler in some ways because this query is no longer about the > > > > canonical type but about the semantics of how the type is expected to > > > > be used. > > > > > > > > I'd definitely like to hear what @rsmith thinks of this approach. > > > I see your point, but I'm not sure how else we could get it through the > > > type checking system without doing it this way? > > > > > > I tried to be careful to only allow the actual character typedefs through > > > by making sure char16_t or char32_t is in the typedef chain, and only in > > > C mode. > > > I see your point, but I'm not sure how else we could get it through the > > > type checking system without doing it this way? > > > > I am thinking that rather than doing this work as part of the `Type` > > object, the extra work can be done from the format string checker. This > > keeps the confusion about types localized to the part of the code that > > needs to care about it, rather than pushing it onto all consumers of the > > type system. > Hmmmm, that's a good idea, but I'm not sure how to do that, let me think on > it for a bit. What does the format string checker do for an example like `printf("%Ls", L"text")` today? ================ Comment at: clang/lib/Sema/SemaChecking.cpp:9534 S, inFunctionCall, Args[format_idx], - S.PDiag(diag::warn_format_string_is_wide_literal), FExpr->getBeginLoc(), + S.PDiag(diag::err_format_invalid_type), FExpr->getBeginLoc(), /*IsStringLocation*/ true, OrigFormatExpr->getSourceRange()); ---------------- Perhaps Clang should warn about use of the new extension in strict conformance modes. CHANGES SINCE LAST ACTION https://reviews.llvm.org/D103426/new/ https://reviews.llvm.org/D103426 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits