Re: [PATCH] libcpp, v4: Add -Winvalid-utf8 warning [PR106655]
On 8/31/22 10:15, Jakub Jelinek wrote: On Wed, Aug 31, 2022 at 09:55:29AM -0400, Jason Merrill wrote: On 8/31/22 07:14, Jakub Jelinek wrote: On Tue, Aug 30, 2022 at 05:51:26PM -0400, Jason Merrill wrote: This hunk now seems worth factoring out of the four places it occurs. It also seems the comment for _cpp_valid_utf8 needs to be updated: it currently says it's not called when parsing a string. Ok, so like this? OK, thanks. Actually, it isn't enough to diagnose this in comments and character/string literals, sorry for finding that out only today. We don't accept invalid UTF-8 in identifiers, it fails the checking in there (most of the times without errors), what we do is create CPP_OTHER tokens out of those and then typically diagnose it when it is used somewhere. Except it doesn't have to be used anywhere, it can be omitted. So if we have say #define I(x) I(���) like in the Winvalid-utf8-3.c test, we silently accept it. This updated version extends the diagnostics even to those cases. I can't use _cpp_handle_multibyte_utf8 in that case because it needs different treatment (no bidi stuff which is emitted already from forms_identifier_p etc.). Tested so far on the new tests, ok for trunk if it passes full bootstrap/regtest? OK. 2022-08-31 Jakub Jelinek PR c++/106655 libcpp/ * include/cpplib.h (struct cpp_options): Implement C++23 P2295R6 - Support for UTF-8 as a portable source file encoding. Add cpp_warn_invalid_utf8 and cpp_input_charset_explicit fields. (enum cpp_warning_reason): Add CPP_W_INVALID_UTF8 enumerator. * init.cc (cpp_create_reader): Initialize cpp_warn_invalid_utf8 and cpp_input_charset_explicit. * charset.cc (_cpp_valid_utf8): Adjust function comment. * lex.cc (UCS_LIMIT): Define. (utf8_continuation): New const variable. (utf8_signifier): Move earlier in the file. (_cpp_warn_invalid_utf8, _cpp_handle_multibyte_utf8): New functions. (_cpp_skip_block_comment): Handle -Winvalid-utf8 warning. (skip_line_comment): Likewise. (lex_raw_string, lex_string): Likewise. (_cpp_lex_direct): Likewise. gcc/ * doc/invoke.texi (-Winvalid-utf8): Document it. gcc/c-family/ * c.opt (-Winvalid-utf8): New warning. * c-opts.c (c_common_handle_option) : Set cpp_opts->cpp_input_charset_explicit. (c_common_post_options): If -finput-charset=UTF-8 is explicit in C++23, enable -Winvalid-utf8 by default and if -pedantic or -pedantic-errors, make it a pedwarn. gcc/testsuite/ * c-c++-common/cpp/Winvalid-utf8-1.c: New test. * c-c++-common/cpp/Winvalid-utf8-2.c: New test. * c-c++-common/cpp/Winvalid-utf8-3.c: New test. * g++.dg/cpp23/Winvalid-utf8-1.C: New test. * g++.dg/cpp23/Winvalid-utf8-2.C: New test. * g++.dg/cpp23/Winvalid-utf8-3.C: New test. * g++.dg/cpp23/Winvalid-utf8-4.C: New test. * g++.dg/cpp23/Winvalid-utf8-5.C: New test. * g++.dg/cpp23/Winvalid-utf8-6.C: New test. * g++.dg/cpp23/Winvalid-utf8-7.C: New test. * g++.dg/cpp23/Winvalid-utf8-8.C: New test. * g++.dg/cpp23/Winvalid-utf8-9.C: New test. * g++.dg/cpp23/Winvalid-utf8-10.C: New test. * g++.dg/cpp23/Winvalid-utf8-11.C: New test. * g++.dg/cpp23/Winvalid-utf8-12.C: New test. --- libcpp/include/cpplib.h.jj 2022-08-31 10:19:45.226452609 +0200 +++ libcpp/include/cpplib.h 2022-08-31 12:25:42.451125755 +0200 @@ -560,6 +560,13 @@ struct cpp_options cpp_bidirectional_level. */ unsigned char cpp_warn_bidirectional; + /* True if libcpp should warn about invalid UTF-8 characters in comments. + 2 if it should be a pedwarn. */ + unsigned char cpp_warn_invalid_utf8; + + /* True if -finput-charset= option has been used explicitly. */ + bool cpp_input_charset_explicit; + /* Dependency generation. */ struct { @@ -666,7 +673,8 @@ enum cpp_warning_reason { CPP_W_CXX11_COMPAT, CPP_W_CXX20_COMPAT, CPP_W_EXPANSION_TO_DEFINED, - CPP_W_BIDIRECTIONAL + CPP_W_BIDIRECTIONAL, + CPP_W_INVALID_UTF8 }; /* Callback for header lookup for HEADER, which is the name of a --- libcpp/init.cc.jj 2022-08-31 10:19:45.260452148 +0200 +++ libcpp/init.cc 2022-08-31 12:25:42.451125755 +0200 @@ -227,6 +227,8 @@ cpp_create_reader (enum c_lang lang, cpp CPP_OPTION (pfile, ext_numeric_literals) = 1; CPP_OPTION (pfile, warn_date_time) = 0; CPP_OPTION (pfile, cpp_warn_bidirectional) = bidirectional_unpaired; + CPP_OPTION (pfile, cpp_warn_invalid_utf8) = 0; + CPP_OPTION (pfile, cpp_input_charset_explicit) = 0; /* Default CPP arithmetic to something sensible for the host for the benefit of dumb users like fix-header. */ --- libcpp/charset.cc.jj2022-08-26 16:06:10.578493272 +0200 +++ libcpp/charset.cc 2022-08-31 12:34:18.921176118 +0200 @@ -1742,9 +1742,9 @@ convert_ucn
[PATCH] libcpp, v4: Add -Winvalid-utf8 warning [PR106655]
On Wed, Aug 31, 2022 at 09:55:29AM -0400, Jason Merrill wrote: > On 8/31/22 07:14, Jakub Jelinek wrote: > > On Tue, Aug 30, 2022 at 05:51:26PM -0400, Jason Merrill wrote: > > > This hunk now seems worth factoring out of the four places it occurs. > > > > > > It also seems the comment for _cpp_valid_utf8 needs to be updated: it > > > currently says it's not called when parsing a string. > > > > Ok, so like this? > > OK, thanks. Actually, it isn't enough to diagnose this in comments and character/string literals, sorry for finding that out only today. We don't accept invalid UTF-8 in identifiers, it fails the checking in there (most of the times without errors), what we do is create CPP_OTHER tokens out of those and then typically diagnose it when it is used somewhere. Except it doesn't have to be used anywhere, it can be omitted. So if we have say #define I(x) I(���) like in the Winvalid-utf8-3.c test, we silently accept it. This updated version extends the diagnostics even to those cases. I can't use _cpp_handle_multibyte_utf8 in that case because it needs different treatment (no bidi stuff which is emitted already from forms_identifier_p etc.). Tested so far on the new tests, ok for trunk if it passes full bootstrap/regtest? 2022-08-31 Jakub Jelinek PR c++/106655 libcpp/ * include/cpplib.h (struct cpp_options): Implement C++23 P2295R6 - Support for UTF-8 as a portable source file encoding. Add cpp_warn_invalid_utf8 and cpp_input_charset_explicit fields. (enum cpp_warning_reason): Add CPP_W_INVALID_UTF8 enumerator. * init.cc (cpp_create_reader): Initialize cpp_warn_invalid_utf8 and cpp_input_charset_explicit. * charset.cc (_cpp_valid_utf8): Adjust function comment. * lex.cc (UCS_LIMIT): Define. (utf8_continuation): New const variable. (utf8_signifier): Move earlier in the file. (_cpp_warn_invalid_utf8, _cpp_handle_multibyte_utf8): New functions. (_cpp_skip_block_comment): Handle -Winvalid-utf8 warning. (skip_line_comment): Likewise. (lex_raw_string, lex_string): Likewise. (_cpp_lex_direct): Likewise. gcc/ * doc/invoke.texi (-Winvalid-utf8): Document it. gcc/c-family/ * c.opt (-Winvalid-utf8): New warning. * c-opts.c (c_common_handle_option) : Set cpp_opts->cpp_input_charset_explicit. (c_common_post_options): If -finput-charset=UTF-8 is explicit in C++23, enable -Winvalid-utf8 by default and if -pedantic or -pedantic-errors, make it a pedwarn. gcc/testsuite/ * c-c++-common/cpp/Winvalid-utf8-1.c: New test. * c-c++-common/cpp/Winvalid-utf8-2.c: New test. * c-c++-common/cpp/Winvalid-utf8-3.c: New test. * g++.dg/cpp23/Winvalid-utf8-1.C: New test. * g++.dg/cpp23/Winvalid-utf8-2.C: New test. * g++.dg/cpp23/Winvalid-utf8-3.C: New test. * g++.dg/cpp23/Winvalid-utf8-4.C: New test. * g++.dg/cpp23/Winvalid-utf8-5.C: New test. * g++.dg/cpp23/Winvalid-utf8-6.C: New test. * g++.dg/cpp23/Winvalid-utf8-7.C: New test. * g++.dg/cpp23/Winvalid-utf8-8.C: New test. * g++.dg/cpp23/Winvalid-utf8-9.C: New test. * g++.dg/cpp23/Winvalid-utf8-10.C: New test. * g++.dg/cpp23/Winvalid-utf8-11.C: New test. * g++.dg/cpp23/Winvalid-utf8-12.C: New test. --- libcpp/include/cpplib.h.jj 2022-08-31 10:19:45.226452609 +0200 +++ libcpp/include/cpplib.h 2022-08-31 12:25:42.451125755 +0200 @@ -560,6 +560,13 @@ struct cpp_options cpp_bidirectional_level. */ unsigned char cpp_warn_bidirectional; + /* True if libcpp should warn about invalid UTF-8 characters in comments. + 2 if it should be a pedwarn. */ + unsigned char cpp_warn_invalid_utf8; + + /* True if -finput-charset= option has been used explicitly. */ + bool cpp_input_charset_explicit; + /* Dependency generation. */ struct { @@ -666,7 +673,8 @@ enum cpp_warning_reason { CPP_W_CXX11_COMPAT, CPP_W_CXX20_COMPAT, CPP_W_EXPANSION_TO_DEFINED, - CPP_W_BIDIRECTIONAL + CPP_W_BIDIRECTIONAL, + CPP_W_INVALID_UTF8 }; /* Callback for header lookup for HEADER, which is the name of a --- libcpp/init.cc.jj 2022-08-31 10:19:45.260452148 +0200 +++ libcpp/init.cc 2022-08-31 12:25:42.451125755 +0200 @@ -227,6 +227,8 @@ cpp_create_reader (enum c_lang lang, cpp CPP_OPTION (pfile, ext_numeric_literals) = 1; CPP_OPTION (pfile, warn_date_time) = 0; CPP_OPTION (pfile, cpp_warn_bidirectional) = bidirectional_unpaired; + CPP_OPTION (pfile, cpp_warn_invalid_utf8) = 0; + CPP_OPTION (pfile, cpp_input_charset_explicit) = 0; /* Default CPP arithmetic to something sensible for the host for the benefit of dumb users like fix-header. */ --- libcpp/charset.cc.jj2022-08-26 16:06:10.578493272 +0200 +++ libcpp/charset.cc 2022-08-31 12:34:18.921176118 +0200 @@ -1742,9 +1742,9 @@ convert_ucn (cpp_reader *pfile, const uc case,