[PATCH v2] libstdc++: Fix handling of surrogate CP in codecvt [PR108976]

2023-03-21 Thread Dimitrij Mijoski via Gcc-patches
This patch fixes the handling of surrogate code points in all standard
facets for transcoding Unicode that are based on std::codecvt. Surrogate
code points should always be treated as error. On the other hand
surrogate code units can only appear in UTF-16 and only when they come
in a proper pair.

Additionally, it fixes a bug in std::codecvt_utf16::in() when odd number
of bytes were given in the range [from, from_end), error was returned
always. The last byte in such range does not form a full UTF-16 code
unit and we can not make any decisions for error, instead partial should
be returned.

The testsuite for testing these facets was updated in the following
order:

1. All functions that test codecvts that work with UTF-8 were refactored
   and made more generic so they accept codecvt that works with the char
   type char8_t.
2. The same functions were updated with new test cases for transcoding
   errors and now additionally test for surrogates, overlong UTF-8
   sequences, code points out of the Unicode range, and more tests for
   missing leading and trailing code units.
3. New tests were added to test codecvt_utf16 in both of its variants,
   UTF-16 <-> UTF-32/UCS-4 and UTF-16 <-> UCS-2.

libstdc++-v3/ChangeLog:

* src/c++11/codecvt.cc (read_utf8_code_point): Fix handing of
surrogates in UTF-8.
(ucs4_out): Fix handling of surrogates in UCS-4 -> UTF-8.
(ucs4_in): Fix handling of range with odd number of bytes.
(ucs4_out): Fix handling of surrogates in UCS-4 -> UTF-16.
(ucs2_out): Fix handling of surrogates in UCS-2 -> UTF-16.
(ucs2_in): Fix handling of range with odd number of bytes.
(__codecvt_utf16_base::do_in): Likewise.
(__codecvt_utf16_base::do_in): Likewise.
(__codecvt_utf16_base::do_in): Likewise.
* testsuite/22_locale/codecvt/codecvt_unicode.cc: Renames, add
tests for codecvt_utf16 and codecvt_utf16.
* testsuite/22_locale/codecvt/codecvt_unicode.h: Refactor UTF-8
testing functions for char8_t, add more test cases for errors,
add testing functions for codecvt_utf16.
* testsuite/22_locale/codecvt/codecvt_unicode_wchar_t.cc:
Renames, add tests for codecvt_utf16.
* testsuite/22_locale/codecvt/codecvt_utf16/79980.cc (test06):
Fix test.
* testsuite/22_locale/codecvt/codecvt_unicode_char8_t.cc: New test.
---
 libstdc++-v3/src/c++11/codecvt.cc |   18 +-
 .../22_locale/codecvt/codecvt_unicode.cc  |   38 +-
 .../22_locale/codecvt/codecvt_unicode.h   | 1799 +
 .../codecvt/codecvt_unicode_char8_t.cc|   53 +
 .../codecvt/codecvt_unicode_wchar_t.cc|   32 +-
 .../22_locale/codecvt/codecvt_utf16/79980.cc  |2 +-
 6 files changed, 1493 insertions(+), 449 deletions(-)
 create mode 100644 
libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode_char8_t.cc

diff --git a/libstdc++-v3/src/c++11/codecvt.cc 
b/libstdc++-v3/src/c++11/codecvt.cc
index 02f05752d..2cc812cfc 100644
--- a/libstdc++-v3/src/c++11/codecvt.cc
+++ b/libstdc++-v3/src/c++11/codecvt.cc
@@ -284,6 +284,8 @@ namespace
return invalid_mb_sequence;
   if (c1 == 0xE0 && c2 < 0xA0) [[unlikely]] // overlong
return invalid_mb_sequence;
+  if (c1 == 0xED && c2 >= 0xA0) [[unlikely]] // surrogate
+   return invalid_mb_sequence;
   if (avail < 3) [[unlikely]]
return incomplete_mb_character;
   char32_t c3 = (unsigned char) from[2];
@@ -484,6 +486,8 @@ namespace
 while (from.size())
   {
const char32_t c = from[0];
+   if (0xD800 <= c && c <= 0xDFFF) [[unlikely]]
+ return codecvt_base::error;
if (c > maxcode) [[unlikely]]
  return codecvt_base::error;
if (!write_utf8_code_point(to, c)) [[unlikely]]
@@ -508,7 +512,7 @@ namespace
  return codecvt_base::error;
to = codepoint;
   }
-return from.size() ? codecvt_base::partial : codecvt_base::ok;
+return from.nbytes() ? codecvt_base::partial : codecvt_base::ok;
   }
 
   // ucs4 -> utf16
@@ -521,6 +525,8 @@ namespace
 while (from.size())
   {
const char32_t c = from[0];
+   if (0xD800 <= c && c <= 0xDFFF) [[unlikely]]
+ return codecvt_base::error;
if (c > maxcode) [[unlikely]]
  return codecvt_base::error;
if (!write_utf16_code_point(to, c, mode)) [[unlikely]]
@@ -653,7 +659,7 @@ namespace
 while (from.size() && to.size())
   {
char16_t c = from[0];
-   if (is_high_surrogate(c))
+   if (0xD800 <= c && c <= 0xDFFF)
  return codecvt_base::error;
if (c > maxcode)
  return codecvt_base::error;
@@ -680,7 +686,7 @@ namespace
  return codecvt_base::error;
to = c;
   }
-return from.size() == 0 ? codecvt_base::ok : codecvt_base::partial;
+return from.nbytes() == 0 ? codecvt_base::ok : codecvt_base::partial;
   }
 
   const char16_t*
@@ -1344,8 +1350,6 @@ 

Re: [PATCH] libstdc++: Fix handling of surrogate CP in codecvt [PR108976]

2023-03-20 Thread Dimitrij Mijoski via Gcc-patches
On Mon, 2023-03-20 at 15:21 +, Jonathan Wakely wrote:
> 
> Thanks, the patch looks OK to my uninformed eye, but I'm seeing a new
> regression:
> 
> /home/jwakely/src/gcc/gcc/libstdc++-
> v3/testsuite/22_locale/codecvt/codecvt_utf16/79980.cc:86: void
> test06(): Assertion 'result == u"from_bytes failed"' failed.
> FAIL: 22_locale/codecvt/codecvt_utf16/79980.cc execution test

Most likely this regression appears because the change related to the
case when odd number of bytes are given to std::codecvt_utf16::in().
The old test 79980.cc:86: void test06() is probably wrong and it should
be changed.


> Also, I see that libc++ fails some of your new tests the same way as
> current libstdc++ does:
> 
> unicode: /home/jwakely/src/gcc/gcc/libstdc++-
> v3/testsuite/22_locale/codecvt/codecvt_unicode.h:298: void
> utf8_to_utf32_in_error(const std::codecvt mbstate_t> &) [InternT = char32_t, ExternT = char]: Assertion `res ==
> cvt.error' failed.
> Aborted (core dumped)
> 
> Does that mean they have the same problem? Or is the test wrong? Or
> is your patch implementing something that contradicts the
> requirements of the standard? I think it's that libc++ has the same
> handling of surrogates, but I'd like to be sure that's right.

See bug https://github.com/llvm/llvm-project/issues/60177 . It can be
reproduced with the testsuite codecvt_unicode without this patch, it is
not related to surrogates. GCC had that bug too but I already fixed it
with my previous big patch on the codecvts.


[PATCH] libstdc++: Fix handling of surrogate CP in codecvt [PR108976]

2023-03-08 Thread Dimitrij Mijoski via Gcc-patches
This patch fixes the handling of surrogate code points in all standard
facets for transcoding Unicode that are based on std::codecvt. Surrogate
code points should always be treated as error. On the other hand
surrogate code units can only appear in UTF-16 and only when they come
in a proper pair.

Additionally, it fixes a bug in std::codecvt_utf16::in() when odd number
of bytes were given in the range [from, from_end), error was returned
always. The last byte in such range does not form a full UTF-16 code
unit and we can not make any decisions for error, instead partial should
be returned.

The testsuite for testing these facets was updated in the following
order:

1. All functions that test codecvts that work with UTF-8 were refactored
   and made more generic so they accept codecvt that works with the char
   type char8_t.
2. The same functions were updated with new test cases for transcoding
   errors and now additionally test for surrogates, overlong UTF-8
   sequences, code points out of the Unicode range, and more tests for
   missing leading and trailing code units.
3. New tests were added to test codecvt_utf16 in both of its variants,
   UTF-16 <-> UTF-32/UCS-4 and UTF-16 <-> UCS-2.

libstdc++-v3/ChangeLog:

* src/c++11/codecvt.cc (read_utf8_code_point): Fix handing of
surrogates in UTF-8.
(ucs4_out): Fix handling of surrogates in UCS-4 -> UTF-8.
(ucs4_in): Fix handling of range with odd number of bytes.
(ucs4_out): Fix handling of surrogates in UCS-4 -> UTF-16.
(ucs2_out): Fix handling of surrogates in UCS-2 -> UTF-16.
(ucs2_in): Fix handling of range with odd number of bytes.
(__codecvt_utf16_base::do_in): Likewise.
(__codecvt_utf16_base::do_in): Likewise.
(__codecvt_utf16_base::do_in): Likewise.
* testsuite/22_locale/codecvt/codecvt_unicode.cc: Renames, add
tests for codecvt_utf16 and codecvt_utf16.
* testsuite/22_locale/codecvt/codecvt_unicode.h: Refactor UTF-8
testing functions for char8_t, add more test cases for errors,
add testing functions for codecvt_utf16.
* testsuite/22_locale/codecvt/codecvt_unicode_wchar_t.cc:
Renames, add tests for codecvt_utf16.
* testsuite/22_locale/codecvt/codecvt_unicode_char8_t.cc: New test.
---
 libstdc++-v3/src/c++11/codecvt.cc |   18 +-
 .../22_locale/codecvt/codecvt_unicode.cc  |   38 +-
 .../22_locale/codecvt/codecvt_unicode.h   | 1799 +
 .../codecvt/codecvt_unicode_char8_t.cc|   53 +
 .../codecvt/codecvt_unicode_wchar_t.cc|   32 +-
 5 files changed, 1492 insertions(+), 448 deletions(-)
 create mode 100644 
libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode_char8_t.cc

diff --git a/libstdc++-v3/src/c++11/codecvt.cc 
b/libstdc++-v3/src/c++11/codecvt.cc
index 02f05752d..2cc812cfc 100644
--- a/libstdc++-v3/src/c++11/codecvt.cc
+++ b/libstdc++-v3/src/c++11/codecvt.cc
@@ -284,6 +284,8 @@ namespace
return invalid_mb_sequence;
   if (c1 == 0xE0 && c2 < 0xA0) [[unlikely]] // overlong
return invalid_mb_sequence;
+  if (c1 == 0xED && c2 >= 0xA0) [[unlikely]] // surrogate
+   return invalid_mb_sequence;
   if (avail < 3) [[unlikely]]
return incomplete_mb_character;
   char32_t c3 = (unsigned char) from[2];
@@ -484,6 +486,8 @@ namespace
 while (from.size())
   {
const char32_t c = from[0];
+   if (0xD800 <= c && c <= 0xDFFF) [[unlikely]]
+ return codecvt_base::error;
if (c > maxcode) [[unlikely]]
  return codecvt_base::error;
if (!write_utf8_code_point(to, c)) [[unlikely]]
@@ -508,7 +512,7 @@ namespace
  return codecvt_base::error;
to = codepoint;
   }
-return from.size() ? codecvt_base::partial : codecvt_base::ok;
+return from.nbytes() ? codecvt_base::partial : codecvt_base::ok;
   }
 
   // ucs4 -> utf16
@@ -521,6 +525,8 @@ namespace
 while (from.size())
   {
const char32_t c = from[0];
+   if (0xD800 <= c && c <= 0xDFFF) [[unlikely]]
+ return codecvt_base::error;
if (c > maxcode) [[unlikely]]
  return codecvt_base::error;
if (!write_utf16_code_point(to, c, mode)) [[unlikely]]
@@ -653,7 +659,7 @@ namespace
 while (from.size() && to.size())
   {
char16_t c = from[0];
-   if (is_high_surrogate(c))
+   if (0xD800 <= c && c <= 0xDFFF)
  return codecvt_base::error;
if (c > maxcode)
  return codecvt_base::error;
@@ -680,7 +686,7 @@ namespace
  return codecvt_base::error;
to = c;
   }
-return from.size() == 0 ? codecvt_base::ok : codecvt_base::partial;
+return from.nbytes() == 0 ? codecvt_base::ok : codecvt_base::partial;
   }
 
   const char16_t*
@@ -1344,8 +1350,6 @@ do_in(state_type&, const extern_type* __from, const 
extern_type* __from_end,
   auto res = ucs2_in(from, to, _M_maxcode, _M_mode);
   __from_next = 

[PATCH] libstdc++: testsuite: Add char8_t to codecvt_unicode

2023-02-08 Thread Dimitrij Mijoski via Gcc-patches
libstdc++-v3/ChangeLog:

* testsuite/22_locale/codecvt/codecvt_unicode.cc: Rename
  functions.
* testsuite/22_locale/codecvt/codecvt_unicode.h: Make more
  generic so it accepts char8_t.
* testsuite/22_locale/codecvt/codecvt_unicode_wchar_t.cc: Rename
  functions.
* testsuite/22_locale/codecvt/codecvt_unicode_char8_t.cc: New test.
---
 .../22_locale/codecvt/codecvt_unicode.cc  |  16 +-
 .../22_locale/codecvt/codecvt_unicode.h   | 807 +-
 .../codecvt/codecvt_unicode_char8_t.cc|  53 ++
 .../codecvt/codecvt_unicode_wchar_t.cc|   6 +-
 4 files changed, 484 insertions(+), 398 deletions(-)
 create mode 100644 
libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode_char8_t.cc

diff --git a/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode.cc 
b/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode.cc
index df1a2b4cc..eafb53a8c 100644
--- a/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode.cc
+++ b/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode.cc
@@ -27,38 +27,38 @@ void
 test_utf8_utf32_codecvts ()
 {
   using codecvt_c32 = codecvt;
-  auto loc_c = locale::classic ();
+  auto _c = locale::classic ();
   VERIFY (has_facet (loc_c));
 
   auto  = use_facet (loc_c);
-  test_utf8_utf32_codecvts (cvt);
+  test_utf8_utf32_cvt (cvt);
 
   codecvt_utf8 cvt2;
-  test_utf8_utf32_codecvts (cvt2);
+  test_utf8_utf32_cvt (cvt2);
 }
 
 void
 test_utf8_utf16_codecvts ()
 {
   using codecvt_c16 = codecvt;
-  auto loc_c = locale::classic ();
+  auto _c = locale::classic ();
   VERIFY (has_facet (loc_c));
 
   auto  = use_facet (loc_c);
-  test_utf8_utf16_cvts (cvt);
+  test_utf8_utf16_cvt (cvt);
 
   codecvt_utf8_utf16 cvt2;
-  test_utf8_utf16_cvts (cvt2);
+  test_utf8_utf16_cvt (cvt2);
 
   codecvt_utf8_utf16 cvt3;
-  test_utf8_utf16_cvts (cvt3);
+  test_utf8_utf16_cvt (cvt3);
 }
 
 void
 test_utf8_ucs2_codecvts ()
 {
   codecvt_utf8 cvt;
-  test_utf8_ucs2_cvts (cvt);
+  test_utf8_ucs2_cvt (cvt);
 }
 
 int
diff --git a/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode.h 
b/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode.h
index fbdc7a35b..690c07215 100644
--- a/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode.h
+++ b/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode.h
@@ -42,33 +42,33 @@ auto constexpr array_size (const T (&)[N]) -> size_t
   return N;
 }
 
-template 
+template 
 void
-utf8_to_utf32_in_ok (const std::codecvt )
+utf8_to_utf32_in_ok (const std::codecvt )
 {
   using namespace std;
   // UTF-8 string of 1-byte CP, 2-byte CP, 3-byte CP and 4-byte CP
-  const char in[] = "bш\u\U0010";
-  const char32_t exp_literal[] = U"bш\u\U0010";
-  CharT exp[array_size (exp_literal)] = {};
-  std::copy (begin (exp_literal), end (exp_literal), begin (exp));
-
-  static_assert (array_size (in) == 11, "");
-  static_assert (array_size (exp_literal) == 5, "");
-  static_assert (array_size (exp) == 5, "");
-  VERIFY (char_traits::length (in) == 10);
-  VERIFY (char_traits::length (exp_literal) == 4);
-  VERIFY (char_traits::length (exp) == 4);
+  const unsigned char input[] = "bш\u\U0010";
+  const char32_t expected[] = U"bш\u\U0010";
+  static_assert (array_size (input) == 11, "");
+  static_assert (array_size (expected) == 5, "");
+
+  ExternT in[array_size (input)];
+  InternT exp[array_size (expected)];
+  copy (begin (input), end (input), begin (in));
+  copy (begin (expected), end (expected), begin (exp));
+  VERIFY (char_traits::length (in) == 10);
+  VERIFY (char_traits::length (exp) == 4);
 
   test_offsets_ok offsets[] = {{0, 0}, {1, 1}, {3, 2}, {6, 3}, {10, 4}};
   for (auto t : offsets)
 {
-  CharT out[array_size (exp) - 1] = {};
+  InternT out[array_size (exp) - 1] = {};
   VERIFY (t.in_size <= array_size (in));
   VERIFY (t.out_size <= array_size (out));
   auto state = mbstate_t{};
-  auto in_next = (const char *) nullptr;
-  auto out_next = (CharT *) nullptr;
+  auto in_next = (const ExternT *) nullptr;
+  auto out_next = (InternT *) nullptr;
   auto res = codecvt_base::result ();
 
   res = cvt.in (state, in, in + t.in_size, in_next, out, out + t.out_size,
@@ -76,19 +76,19 @@ utf8_to_utf32_in_ok (const std::codecvt )
   VERIFY (res == cvt.ok);
   VERIFY (in_next == in + t.in_size);
   VERIFY (out_next == out + t.out_size);
-  VERIFY (char_traits::compare (out, exp, t.out_size) == 0);
+  VERIFY (char_traits::compare (out, exp, t.out_size) == 0);
   if (t.out_size < array_size (out))
VERIFY (out[t.out_size] == 0);
 }
 
   for (auto t : offsets)
 {
-  CharT out[array_size (exp)] = {};
+  InternT out[array_size (exp)] = {};
   VERIFY (t.in_size <= array_size (in));
   VERIFY (t.out_size <= array_size (out));
   auto state = mbstate_t{};
-  auto in_next = (const char *) nullptr;
-  auto out_next = (CharT *) nullptr;
+  

Re: [PATCH] libstdc++: testsuite: Simplify codecvt_unicode

2023-01-18 Thread Dimitrij Mijoski via Gcc-patches
On Wed, 2023-01-18 at 18:53 +, Jonathan Wakely wrote:
> This doesn't compile in C++11 or C++14, because there's no guaranteed
> elision.

I see. I just looked up in the docs and found that I need to put
--target_board=unix/-std=c++11 inside RUNTESTFLAGS to test in C++11
mode.


[PATCH] libstdc++: testsuite: Simplify codecvt_unicode

2023-01-17 Thread Dimitrij Mijoski via Gcc-patches
Stop using unique_ptr, create some objects directly.

libstdc++-v3/ChangeLog:

* testsuite/22_locale/codecvt/codecvt_unicode.cc: Simplify.
* testsuite/22_locale/codecvt/codecvt_unicode.h: Simplify.
* testsuite/22_locale/codecvt/codecvt_unicode_wchar_t.cc: Simplify.
---
 .../22_locale/codecvt/codecvt_unicode.cc   | 18 ++
 .../22_locale/codecvt/codecvt_unicode.h|  9 +
 .../codecvt/codecvt_unicode_wchar_t.cc | 12 ++--
 3 files changed, 17 insertions(+), 22 deletions(-)

diff --git a/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode.cc 
b/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode.cc
index ae4b6c896..3d7393e4a 100644
--- a/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode.cc
+++ b/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode.cc
@@ -29,11 +29,12 @@ test_utf8_utf32_codecvts ()
   using codecvt_c32 = codecvt;
   auto loc_c = locale::classic ();
   VERIFY (has_facet (loc_c));
+
   auto  = use_facet (loc_c);
   test_utf8_utf32_codecvts (cvt);
 
-  auto cvt_ptr = to_unique_ptr (new codecvt_utf8 ());
-  test_utf8_utf32_codecvts (*cvt_ptr);
+  auto cvt2 = codecvt_utf8 ();
+  test_utf8_utf32_codecvts (cvt2);
 }
 
 void
@@ -42,21 +43,22 @@ test_utf8_utf16_codecvts ()
   using codecvt_c16 = codecvt;
   auto loc_c = locale::classic ();
   VERIFY (has_facet (loc_c));
+
   auto  = use_facet (loc_c);
   test_utf8_utf16_cvts (cvt);
 
-  auto cvt_ptr = to_unique_ptr (new codecvt_utf8_utf16 ());
-  test_utf8_utf16_cvts (*cvt_ptr);
+  auto cvt2 = codecvt_utf8_utf16 ();
+  test_utf8_utf16_cvts (cvt2);
 
-  auto cvt_ptr2 = to_unique_ptr (new codecvt_utf8_utf16 ());
-  test_utf8_utf16_cvts (*cvt_ptr2);
+  auto cvt3 = codecvt_utf8_utf16 ();
+  test_utf8_utf16_cvts (cvt3);
 }
 
 void
 test_utf8_ucs2_codecvts ()
 {
-  auto cvt_ptr = to_unique_ptr (new codecvt_utf8 ());
-  test_utf8_ucs2_cvts (*cvt_ptr);
+  auto cvt = codecvt_utf8 ();
+  test_utf8_ucs2_cvts (cvt);
 }
 
 int
diff --git a/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode.h 
b/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode.h
index 99d1a4684..fbdc7a35b 100644
--- a/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode.h
+++ b/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode.h
@@ -15,18 +15,11 @@
 // with this library; see the file COPYING3.  If not see
 // .
 
+#include 
 #include 
 #include 
-#include 
 #include 
 
-template 
-std::unique_ptr
-to_unique_ptr (T *ptr)
-{
-  return std::unique_ptr (ptr);
-}
-
 struct test_offsets_ok
 {
   size_t in_size, out_size;
diff --git 
a/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode_wchar_t.cc 
b/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode_wchar_t.cc
index 169504939..f7a0a4fd8 100644
--- a/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode_wchar_t.cc
+++ b/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode_wchar_t.cc
@@ -27,8 +27,8 @@ void
 test_utf8_utf32_codecvts ()
 {
 #if __SIZEOF_WCHAR_T__ == 4
-  auto cvt_ptr = to_unique_ptr (new codecvt_utf8 ());
-  test_utf8_utf32_codecvts (*cvt_ptr);
+  auto cvt = codecvt_utf8 ();
+  test_utf8_utf32_codecvts (cvt);
 #endif
 }
 
@@ -36,8 +36,8 @@ void
 test_utf8_utf16_codecvts ()
 {
 #if __SIZEOF_WCHAR_T__ >= 2
-  auto cvt_ptr = to_unique_ptr (new codecvt_utf8_utf16 ());
-  test_utf8_utf16_cvts (*cvt_ptr);
+  auto cvt = codecvt_utf8_utf16 ();
+  test_utf8_utf16_cvts (cvt);
 #endif
 }
 
@@ -45,8 +45,8 @@ void
 test_utf8_ucs2_codecvts ()
 {
 #if __SIZEOF_WCHAR_T__ == 2
-  auto cvt_ptr = to_unique_ptr (new codecvt_utf8 ());
-  test_utf8_ucs2_cvts (*cvt_ptr);
+  auto cvt = codecvt_utf8 ();
+  test_utf8_ucs2_cvts (cvt);
 #endif
 }
 
-- 
2.34.1




Re: [PATCH v2] libstdc++: Fix Unicode codecvt and add tests [PR86419]

2023-01-10 Thread Dimitrij Mijoski via Gcc-patches
On Tue, 2023-01-10 at 13:28 +, Jonathan Wakely wrote:
> Thanks for the patch. Do you have a copyright assignment for gcc
> filed with the FSF? 

Yes, I have already signed the copyright assignment.


[PATCH v2] libstdc++: Fix Unicode codecvt and add tests [PR86419]

2023-01-10 Thread Dimitrij Mijoski via Gcc-patches
Fixes the conversion from UTF-8 to UTF-16 to properly return partial
instead ok.
Fixes the conversion from UTF-16 to UTF-8 to properly return partial
instead ok.
Fixes the conversion from UTF-8 to UCS-2 to properly return partial
instead error.
Fixes the conversion from UTF-8 to UCS-2 to treat 4-byte UTF-8 sequences
as error just by seeing the leading byte.
Fixes UTF-8 decoding for all codecvts so they detect error at the end of
the input range when the last code point is also incomplete.

libstdc++-v3/ChangeLog:
PR libstdc++/86419
* src/c++11/codecvt.cc: Fix bugs.
* testsuite/22_locale/codecvt/codecvt_unicode.cc: New tests.
* testsuite/22_locale/codecvt/codecvt_unicode.h: New tests.
* testsuite/22_locale/codecvt/codecvt_unicode_wchar_t.cc: New
  tests.
---
 libstdc++-v3/src/c++11/codecvt.cc |   38 +-
 .../22_locale/codecvt/codecvt_unicode.cc  |   68 +
 .../22_locale/codecvt/codecvt_unicode.h   | 1268 +
 .../codecvt/codecvt_unicode_wchar_t.cc|   59 +
 4 files changed, 1414 insertions(+), 19 deletions(-)
 create mode 100644 libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode.cc
 create mode 100644 libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode.h
 create mode 100644 
libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode_wchar_t.cc

diff --git a/libstdc++-v3/src/c++11/codecvt.cc 
b/libstdc++-v3/src/c++11/codecvt.cc
index 9f8cb7677..49282a510 100644
--- a/libstdc++-v3/src/c++11/codecvt.cc
+++ b/libstdc++-v3/src/c++11/codecvt.cc
@@ -277,13 +277,15 @@ namespace
 }
 else if (c1 < 0xF0) // 3-byte sequence
 {
-  if (avail < 3)
+  if (avail < 2)
return incomplete_mb_character;
   char32_t c2 = (unsigned char) from[1];
   if ((c2 & 0xC0) != 0x80)
return invalid_mb_sequence;
   if (c1 == 0xE0 && c2 < 0xA0) // overlong
return invalid_mb_sequence;
+  if (avail < 3)
+   return incomplete_mb_character;
   char32_t c3 = (unsigned char) from[2];
   if ((c3 & 0xC0) != 0x80)
return invalid_mb_sequence;
@@ -292,9 +294,9 @@ namespace
from += 3;
   return c;
 }
-else if (c1 < 0xF5) // 4-byte sequence
+else if (c1 < 0xF5 && maxcode > 0x) // 4-byte sequence
 {
-  if (avail < 4)
+  if (avail < 2)
return incomplete_mb_character;
   char32_t c2 = (unsigned char) from[1];
   if ((c2 & 0xC0) != 0x80)
@@ -302,10 +304,14 @@ namespace
   if (c1 == 0xF0 && c2 < 0x90) // overlong
return invalid_mb_sequence;
   if (c1 == 0xF4 && c2 >= 0x90) // > U+10
-  return invalid_mb_sequence;
+   return invalid_mb_sequence;
+  if (avail < 3)
+   return incomplete_mb_character;
   char32_t c3 = (unsigned char) from[2];
   if ((c3 & 0xC0) != 0x80)
return invalid_mb_sequence;
+  if (avail < 4)
+   return incomplete_mb_character;
   char32_t c4 = (unsigned char) from[3];
   if ((c4 & 0xC0) != 0x80)
return invalid_mb_sequence;
@@ -527,12 +533,11 @@ namespace
   // Flag indicating whether to process UTF-16 or UCS2
   enum class surrogates { allowed, disallowed };
 
-  // utf8 -> utf16 (or utf8 -> ucs2 if s == surrogates::disallowed)
-  template
-  codecvt_base::result
-  utf16_in(range& from, range& to,
-  unsigned long maxcode = max_code_point, codecvt_mode mode = {},
-  surrogates s = surrogates::allowed)
+  // utf8 -> utf16 (or utf8 -> ucs2 if maxcode <= 0x)
+  template 
+  codecvt_base::result utf16_in (range , range ,
+unsigned long maxcode = max_code_point,
+codecvt_mode mode = {})
   {
 read_utf8_bom(from, mode);
 while (from.size() && to.size())
@@ -540,12 +545,7 @@ namespace
auto orig = from;
const char32_t codepoint = read_utf8_code_point(from, maxcode);
if (codepoint == incomplete_mb_character)
- {
-   if (s == surrogates::allowed)
- return codecvt_base::partial;
-   else
- return codecvt_base::error; // No surrogates in UCS2
- }
+ return codecvt_base::partial;
if (codepoint > maxcode)
  return codecvt_base::error;
if (!write_utf16_code_point(to, codepoint, mode))
@@ -554,7 +554,7 @@ namespace
return codecvt_base::partial;
  }
   }
-return codecvt_base::ok;
+return from.size () ? codecvt_base::partial : codecvt_base::ok;
   }
 
   // utf16 -> utf8 (or ucs2 -> utf8 if s == surrogates::disallowed)
@@ -576,7 +576,7 @@ namespace
  return codecvt_base::error; // No surrogates in UCS-2
 
if (from.size() < 2)
- return codecvt_base::ok; // stop converting at this point
+ return codecvt_base::partial; // stop converting at this point
 
const char32_t c2 = from[1];
if (is_low_surrogate(c2))
@@ -629,7 +629,7 @@ namespace
   {
 // UCS-2 

[PATCH] libstdc++: Fix Unicode codecvt and add tests [PR86419]

2020-09-24 Thread Dimitrij Mijoski via Gcc-patches
Fixes the conversion from UTF-8 to UTF-16 to properly return partial
instead ok.
Fixes the conversion from UTF-16 to UTF-8 to properly return partial
instead ok.
Fixes the conversion from UTF-8 to UCS-2 to properly return partial
instead error.
Fixes the conversion from UTF-8 to UCS-2 to treat 4-byte UTF-8 sequences
as error just by seeing the leading byte.
Fixes UTF-8 decoding for all codecvts so they detect error at the end of
the input range when the last code point is also incomplete.

The testsute is large and may need splitting into multiple files.

libstdc++-v3/ChangeLog:
PR libstdc++/86419
* src/c++11/codecvt.cc: Fix bugs.
* testsuite/22_locale/codecvt/codecvt_unicode.cc: New tests.
---
 libstdc++-v3/src/c++11/codecvt.cc |   25 +-
 .../22_locale/codecvt/codecvt_unicode.cc  | 1310 +
 2 files changed, 1323 insertions(+), 12 deletions(-)
 create mode 100644 libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode.cc

diff --git a/libstdc++-v3/src/c++11/codecvt.cc 
b/libstdc++-v3/src/c++11/codecvt.cc
index 0311b15177d0..4545ba1b5933 100644
--- a/libstdc++-v3/src/c++11/codecvt.cc
+++ b/libstdc++-v3/src/c++11/codecvt.cc
@@ -277,13 +277,15 @@ namespace
 }
 else if (c1 < 0xF0) // 3-byte sequence
 {
-  if (avail < 3)
+  if (avail < 2)
return incomplete_mb_character;
   unsigned char c2 = from[1];
   if ((c2 & 0xC0) != 0x80)
return invalid_mb_sequence;
   if (c1 == 0xE0 && c2 < 0xA0) // overlong
return invalid_mb_sequence;
+  if (avail < 3)
+   return incomplete_mb_character;
   unsigned char c3 = from[2];
   if ((c3 & 0xC0) != 0x80)
return invalid_mb_sequence;
@@ -292,9 +294,9 @@ namespace
from += 3;
   return c;
 }
-else if (c1 < 0xF5) // 4-byte sequence
+else if (c1 < 0xF5 && maxcode > 0x) // 4-byte sequence
 {
-  if (avail < 4)
+  if (avail < 2)
return incomplete_mb_character;
   unsigned char c2 = from[1];
   if ((c2 & 0xC0) != 0x80)
@@ -302,10 +304,14 @@ namespace
   if (c1 == 0xF0 && c2 < 0x90) // overlong
return invalid_mb_sequence;
   if (c1 == 0xF4 && c2 >= 0x90) // > U+10
-  return invalid_mb_sequence;
+   return invalid_mb_sequence;
+  if (avail < 3)
+   return incomplete_mb_character;
   unsigned char c3 = from[2];
   if ((c3 & 0xC0) != 0x80)
return invalid_mb_sequence;
+  if (avail < 4)
+   return incomplete_mb_character;
   unsigned char c4 = from[3];
   if ((c4 & 0xC0) != 0x80)
return invalid_mb_sequence;
@@ -540,12 +546,7 @@ namespace
auto orig = from;
const char32_t codepoint = read_utf8_code_point(from, maxcode);
if (codepoint == incomplete_mb_character)
- {
-   if (s == surrogates::allowed)
- return codecvt_base::partial;
-   else
- return codecvt_base::error; // No surrogates in UCS2
- }
+ return codecvt_base::partial;
if (codepoint > maxcode)
  return codecvt_base::error;
if (!write_utf16_code_point(to, codepoint, mode))
@@ -554,7 +555,7 @@ namespace
return codecvt_base::partial;
  }
   }
-return codecvt_base::ok;
+return from.size() ? codecvt_base::partial : codecvt_base::ok;
   }
 
   // utf16 -> utf8 (or ucs2 -> utf8 if s == surrogates::disallowed)
@@ -576,7 +577,7 @@ namespace
  return codecvt_base::error; // No surrogates in UCS-2
 
if (from.size() < 2)
- return codecvt_base::ok; // stop converting at this point
+ return codecvt_base::partial; // stop converting at this point
 
const char32_t c2 = from[1];
if (is_low_surrogate(c2))
diff --git a/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode.cc 
b/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode.cc
new file mode 100644
index ..88afd49206d1
--- /dev/null
+++ b/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode.cc
@@ -0,0 +1,1310 @@
+// Copyright (C) 2020 Free Software Foundation, Inc.
+//
+// This file is part of the GNU ISO C++ Library.  This library is free
+// software; you can redistribute it and/or modify it under the
+// terms of the GNU General Public License as published by the
+// Free Software Foundation; either version 3, or (at your option)
+// any later version.
+
+// This library is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+
+// You should have received a copy of the GNU General Public License along
+// with this library; see the file COPYING3.  If not see
+// ;.
+
+// { dg-do run { target c++11 } }
+
+#include 
+#include 
+#include 
+#include 
+
+using namespace std;
+
+template 

[PATCH] Improve contrib/clang-format to work with C++11 code [PR97076]

2020-09-17 Thread Dimitrij Mijoski via Gcc-patches
contrib/ChangeLog:
PR other/97076
* clang-format: Update.
---
 contrib/clang-format | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/contrib/clang-format b/contrib/clang-format
index 7a4e96f64ca..ceb5c1d524f 100644
--- a/contrib/clang-format
+++ b/contrib/clang-format
@@ -147,4 +147,4 @@ AlignTrailingComments: true
 AllowShortFunctionsOnASingleLine: All
 AlwaysBreakTemplateDeclarations: MultiLine
 KeepEmptyLinesAtTheStartOfBlocks: false
-Standard: Cpp03
+Standard: Auto
-- 
2.25.1