Re: [PATCH] libcpp, v4: Add -Winvalid-utf8 warning [PR106655]

2022-08-31 Thread Jason Merrill via Gcc-patches

On 8/31/22 10:15, Jakub Jelinek wrote:

On Wed, Aug 31, 2022 at 09:55:29AM -0400, Jason Merrill wrote:

On 8/31/22 07:14, Jakub Jelinek wrote:

On Tue, Aug 30, 2022 at 05:51:26PM -0400, Jason Merrill wrote:

This hunk now seems worth factoring out of the four places it occurs.

It also seems the comment for _cpp_valid_utf8 needs to be updated: it
currently says it's not called when parsing a string.


Ok, so like this?


OK, thanks.


Actually, it isn't enough to diagnose this in comments and character/string
literals, sorry for finding that out only today.

We don't accept invalid UTF-8 in identifiers, it fails the checking in there
(most of the times without errors), what we do is create CPP_OTHER tokens
out of those and then typically diagnose it when it is used somewhere.
Except it doesn't have to be used anywhere, it can be omitted.

So if we have say
#define I(x)
I(���)
like in the Winvalid-utf8-3.c test, we silently accept it.

This updated version extends the diagnostics even to those cases.
I can't use _cpp_handle_multibyte_utf8 in that case because it needs
different treatment (no bidi stuff which is emitted already from
forms_identifier_p etc.).

Tested so far on the new tests, ok for trunk if it passes full
bootstrap/regtest?


OK.


2022-08-31  Jakub Jelinek  

PR c++/106655
libcpp/
* include/cpplib.h (struct cpp_options): Implement C++23
P2295R6 - Support for UTF-8 as a portable source file encoding.
Add cpp_warn_invalid_utf8 and cpp_input_charset_explicit fields.
(enum cpp_warning_reason): Add CPP_W_INVALID_UTF8 enumerator.
* init.cc (cpp_create_reader): Initialize cpp_warn_invalid_utf8
and cpp_input_charset_explicit.
* charset.cc (_cpp_valid_utf8): Adjust function comment.
* lex.cc (UCS_LIMIT): Define.
(utf8_continuation): New const variable.
(utf8_signifier): Move earlier in the file.
(_cpp_warn_invalid_utf8, _cpp_handle_multibyte_utf8): New functions.
(_cpp_skip_block_comment): Handle -Winvalid-utf8 warning.
(skip_line_comment): Likewise.
(lex_raw_string, lex_string): Likewise.
(_cpp_lex_direct): Likewise.
gcc/
* doc/invoke.texi (-Winvalid-utf8): Document it.
gcc/c-family/
* c.opt (-Winvalid-utf8): New warning.
* c-opts.c (c_common_handle_option) :
Set cpp_opts->cpp_input_charset_explicit.
(c_common_post_options): If -finput-charset=UTF-8 is explicit
in C++23, enable -Winvalid-utf8 by default and if -pedantic
or -pedantic-errors, make it a pedwarn.
gcc/testsuite/
* c-c++-common/cpp/Winvalid-utf8-1.c: New test.
* c-c++-common/cpp/Winvalid-utf8-2.c: New test.
* c-c++-common/cpp/Winvalid-utf8-3.c: New test.
* g++.dg/cpp23/Winvalid-utf8-1.C: New test.
* g++.dg/cpp23/Winvalid-utf8-2.C: New test.
* g++.dg/cpp23/Winvalid-utf8-3.C: New test.
* g++.dg/cpp23/Winvalid-utf8-4.C: New test.
* g++.dg/cpp23/Winvalid-utf8-5.C: New test.
* g++.dg/cpp23/Winvalid-utf8-6.C: New test.
* g++.dg/cpp23/Winvalid-utf8-7.C: New test.
* g++.dg/cpp23/Winvalid-utf8-8.C: New test.
* g++.dg/cpp23/Winvalid-utf8-9.C: New test.
* g++.dg/cpp23/Winvalid-utf8-10.C: New test.
* g++.dg/cpp23/Winvalid-utf8-11.C: New test.
* g++.dg/cpp23/Winvalid-utf8-12.C: New test.

--- libcpp/include/cpplib.h.jj  2022-08-31 10:19:45.226452609 +0200
+++ libcpp/include/cpplib.h 2022-08-31 12:25:42.451125755 +0200
@@ -560,6 +560,13 @@ struct cpp_options
   cpp_bidirectional_level.  */
unsigned char cpp_warn_bidirectional;
  
+  /* True if libcpp should warn about invalid UTF-8 characters in comments.

+ 2 if it should be a pedwarn.  */
+  unsigned char cpp_warn_invalid_utf8;
+
+  /* True if -finput-charset= option has been used explicitly.  */
+  bool cpp_input_charset_explicit;
+
/* Dependency generation.  */
struct
{
@@ -666,7 +673,8 @@ enum cpp_warning_reason {
CPP_W_CXX11_COMPAT,
CPP_W_CXX20_COMPAT,
CPP_W_EXPANSION_TO_DEFINED,
-  CPP_W_BIDIRECTIONAL
+  CPP_W_BIDIRECTIONAL,
+  CPP_W_INVALID_UTF8
  };
  
  /* Callback for header lookup for HEADER, which is the name of a

--- libcpp/init.cc.jj   2022-08-31 10:19:45.260452148 +0200
+++ libcpp/init.cc  2022-08-31 12:25:42.451125755 +0200
@@ -227,6 +227,8 @@ cpp_create_reader (enum c_lang lang, cpp
CPP_OPTION (pfile, ext_numeric_literals) = 1;
CPP_OPTION (pfile, warn_date_time) = 0;
CPP_OPTION (pfile, cpp_warn_bidirectional) = bidirectional_unpaired;
+  CPP_OPTION (pfile, cpp_warn_invalid_utf8) = 0;
+  CPP_OPTION (pfile, cpp_input_charset_explicit) = 0;
  
/* Default CPP arithmetic to something sensible for the host for the

   benefit of dumb users like fix-header.  */
--- libcpp/charset.cc.jj2022-08-26 16:06:10.578493272 +0200
+++ libcpp/charset.cc   2022-08-31 12:34:18.921176118 +0200
@@ -1742,9 +1742,9 @@ convert_ucn 

[PATCH] libcpp, v4: Add -Winvalid-utf8 warning [PR106655]

2022-08-31 Thread Jakub Jelinek via Gcc-patches
On Wed, Aug 31, 2022 at 09:55:29AM -0400, Jason Merrill wrote:
> On 8/31/22 07:14, Jakub Jelinek wrote:
> > On Tue, Aug 30, 2022 at 05:51:26PM -0400, Jason Merrill wrote:
> > > This hunk now seems worth factoring out of the four places it occurs.
> > > 
> > > It also seems the comment for _cpp_valid_utf8 needs to be updated: it
> > > currently says it's not called when parsing a string.
> > 
> > Ok, so like this?
> 
> OK, thanks.

Actually, it isn't enough to diagnose this in comments and character/string
literals, sorry for finding that out only today.

We don't accept invalid UTF-8 in identifiers, it fails the checking in there
(most of the times without errors), what we do is create CPP_OTHER tokens
out of those and then typically diagnose it when it is used somewhere.
Except it doesn't have to be used anywhere, it can be omitted.

So if we have say
#define I(x)
I(���)
like in the Winvalid-utf8-3.c test, we silently accept it.

This updated version extends the diagnostics even to those cases.
I can't use _cpp_handle_multibyte_utf8 in that case because it needs
different treatment (no bidi stuff which is emitted already from
forms_identifier_p etc.).

Tested so far on the new tests, ok for trunk if it passes full
bootstrap/regtest?

2022-08-31  Jakub Jelinek  

PR c++/106655
libcpp/
* include/cpplib.h (struct cpp_options): Implement C++23
P2295R6 - Support for UTF-8 as a portable source file encoding.
Add cpp_warn_invalid_utf8 and cpp_input_charset_explicit fields.
(enum cpp_warning_reason): Add CPP_W_INVALID_UTF8 enumerator.
* init.cc (cpp_create_reader): Initialize cpp_warn_invalid_utf8
and cpp_input_charset_explicit.
* charset.cc (_cpp_valid_utf8): Adjust function comment.
* lex.cc (UCS_LIMIT): Define.
(utf8_continuation): New const variable.
(utf8_signifier): Move earlier in the file.
(_cpp_warn_invalid_utf8, _cpp_handle_multibyte_utf8): New functions.
(_cpp_skip_block_comment): Handle -Winvalid-utf8 warning.
(skip_line_comment): Likewise.
(lex_raw_string, lex_string): Likewise.
(_cpp_lex_direct): Likewise.
gcc/
* doc/invoke.texi (-Winvalid-utf8): Document it.
gcc/c-family/
* c.opt (-Winvalid-utf8): New warning.
* c-opts.c (c_common_handle_option) :
Set cpp_opts->cpp_input_charset_explicit.
(c_common_post_options): If -finput-charset=UTF-8 is explicit
in C++23, enable -Winvalid-utf8 by default and if -pedantic
or -pedantic-errors, make it a pedwarn.
gcc/testsuite/
* c-c++-common/cpp/Winvalid-utf8-1.c: New test.
* c-c++-common/cpp/Winvalid-utf8-2.c: New test.
* c-c++-common/cpp/Winvalid-utf8-3.c: New test.
* g++.dg/cpp23/Winvalid-utf8-1.C: New test.
* g++.dg/cpp23/Winvalid-utf8-2.C: New test.
* g++.dg/cpp23/Winvalid-utf8-3.C: New test.
* g++.dg/cpp23/Winvalid-utf8-4.C: New test.
* g++.dg/cpp23/Winvalid-utf8-5.C: New test.
* g++.dg/cpp23/Winvalid-utf8-6.C: New test.
* g++.dg/cpp23/Winvalid-utf8-7.C: New test.
* g++.dg/cpp23/Winvalid-utf8-8.C: New test.
* g++.dg/cpp23/Winvalid-utf8-9.C: New test.
* g++.dg/cpp23/Winvalid-utf8-10.C: New test.
* g++.dg/cpp23/Winvalid-utf8-11.C: New test.
* g++.dg/cpp23/Winvalid-utf8-12.C: New test.

--- libcpp/include/cpplib.h.jj  2022-08-31 10:19:45.226452609 +0200
+++ libcpp/include/cpplib.h 2022-08-31 12:25:42.451125755 +0200
@@ -560,6 +560,13 @@ struct cpp_options
  cpp_bidirectional_level.  */
   unsigned char cpp_warn_bidirectional;
 
+  /* True if libcpp should warn about invalid UTF-8 characters in comments.
+ 2 if it should be a pedwarn.  */
+  unsigned char cpp_warn_invalid_utf8;
+
+  /* True if -finput-charset= option has been used explicitly.  */
+  bool cpp_input_charset_explicit;
+
   /* Dependency generation.  */
   struct
   {
@@ -666,7 +673,8 @@ enum cpp_warning_reason {
   CPP_W_CXX11_COMPAT,
   CPP_W_CXX20_COMPAT,
   CPP_W_EXPANSION_TO_DEFINED,
-  CPP_W_BIDIRECTIONAL
+  CPP_W_BIDIRECTIONAL,
+  CPP_W_INVALID_UTF8
 };
 
 /* Callback for header lookup for HEADER, which is the name of a
--- libcpp/init.cc.jj   2022-08-31 10:19:45.260452148 +0200
+++ libcpp/init.cc  2022-08-31 12:25:42.451125755 +0200
@@ -227,6 +227,8 @@ cpp_create_reader (enum c_lang lang, cpp
   CPP_OPTION (pfile, ext_numeric_literals) = 1;
   CPP_OPTION (pfile, warn_date_time) = 0;
   CPP_OPTION (pfile, cpp_warn_bidirectional) = bidirectional_unpaired;
+  CPP_OPTION (pfile, cpp_warn_invalid_utf8) = 0;
+  CPP_OPTION (pfile, cpp_input_charset_explicit) = 0;
 
   /* Default CPP arithmetic to something sensible for the host for the
  benefit of dumb users like fix-header.  */
--- libcpp/charset.cc.jj2022-08-26 16:06:10.578493272 +0200
+++ libcpp/charset.cc   2022-08-31 12:34:18.921176118 +0200
@@ -1742,9 +1742,9 @@ convert_ucn (cpp_reader *pfile, const uc
 case,