In the thread "From wchar_t to char32_t" we discussed the mbrtoc32 function,
in particular.
mbrtoc32, compared to mbrtowc, has two new features:
(a) it overcomes wchar_t limitations, especially the fact that on Windows,
wchar_t is only 16 bits wide.
(b) it allows a multibyte sequence to be mapped to a sequence of char32_t
characters, whereas mbrtowc maps a multibyte sequence to a single
wchar_t (or returns an error).
With (a), we can satisfy
Goal (A): Support non-BMP characters (such as Emojis) better on Windows,
including Cygwin.
With (b), we could theoretically satisfy
Goal (B): Support locales with BIG5-HKSCS encoding better.
However, (B) is a NON-GOAL.
1) Hardly anyone uses the BIG5-HKSCS encoding.
2) As we have found out, through the diffutils exercise and the 'dfa'
module, supporting goal (B) means that
* Applications need to distinguish places where it's OK to handle
the several Unicode characters separately, such as in mbswidth,
from places where the multibyte character has to be kept as a unit,
and thus a wchar_t needs to be replaced not with a single char32_t
but with a sequence of char32_t.
* Accordingly, there is a need for two different modules 'mbchar' —
one that produces a single Unicode character at a time, and one
that produces a sequence of Unicode characters.
* Likewise for the modules 'mbiter' and 'mbuiter'.
This is basically the sort of complexity that we did NOT want to add
for supporting Windows with mbrtowc.
3) It's also a testability problem. Code that is not tested is buggy,
in general. There is no glibc version so far that implements the
mbrtoc32 with BIG5-HKSCS encoding correctly; see
<https://sourceware.org/bugzilla/show_bug.cgi?id=30611>.
In order to test application code, we would have to write an alternate
mbrtoc32 function which, for example, maps the 'ä' character to
U+0041 U+0308.
But this would be even more complexity, for the sake of a hypothetical
scenario.
Paul seems to agree that this is a non-goal:
- https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00021.html
"We don't have time to support every oddball coding system that POSIX
allows."
- https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00026.html
"And since it'll likely be a hassle to port the rest of the code to
purely-theoretical platforms where nbytes == (size_t) -3, I suggest
instead simply adding a comment that nbytes cannot be (size_t) -3 there."
- https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00032.html
"you and I have already spent more time on theoretical platforms than
they're likely worth"
Adding a comment would be a possibility. But we can do better by formalizing
the notion that we do NOT want (b).
DEFINITION: We call an mbrtoc32 function _regular_ if
- It never returns (size_t)-3.
- When it returns < (size_t)-2, the mbstate_t is in the initial state.
Here I'm adding a Gnulib module that provides a _regular_ mbrtoc32 function.
With a unit test. (Once we have formalized the notion, we can test it through
a unit test.)
2023-07-10 Bruno Haible <[email protected]>
mbrtoc32-regular: Add tests.
* tests/test-mbrtoc32-regular.c: New file.
* modules/mbrtoc32-regular-tests: New file.
mbrtoc32-regular: New module.
* modules/mbrtoc32-regular: New file.
* lib/mbrtoc32.c (mbrtoc32): If the module 'mbrtoc32-regular' is present
and the system's mbrtoc32 returned a char32_t, clear the mbstate_t.
* doc/posix-functions/mbrtoc32.texi: Mention the new module.
>From 0b55d1c3fbcb9bfa4b49a9aca16006294d118637 Mon Sep 17 00:00:00 2001
From: Bruno Haible <[email protected]>
Date: Tue, 11 Jul 2023 00:03:34 +0200
Subject: [PATCH 1/2] mbrtoc32-regular: New module.
* modules/mbrtoc32-regular: New file.
* lib/mbrtoc32.c (mbrtoc32): If the module 'mbrtoc32-regular' is present
and the system's mbrtoc32 returned a char32_t, clear the mbstate_t.
* doc/posix-functions/mbrtoc32.texi: Mention the new module.
---
ChangeLog | 8 ++++++++
doc/posix-functions/mbrtoc32.texi | 24 +++++++++++++++---------
lib/mbrtoc32.c | 9 +++++++++
modules/mbrtoc32-regular | 27 +++++++++++++++++++++++++++
4 files changed, 59 insertions(+), 9 deletions(-)
create mode 100644 modules/mbrtoc32-regular
diff --git a/ChangeLog b/ChangeLog
index fdc8e42ad4..c8dc122aa4 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,11 @@
+2023-07-10 Bruno Haible <[email protected]>
+
+ mbrtoc32-regular: New module.
+ * modules/mbrtoc32-regular: New file.
+ * lib/mbrtoc32.c (mbrtoc32): If the module 'mbrtoc32-regular' is present
+ and the system's mbrtoc32 returned a char32_t, clear the mbstate_t.
+ * doc/posix-functions/mbrtoc32.texi: Mention the new module.
+
2023-07-10 Bruno Haible <[email protected]>
Apply the last change to all locale-*.m4 files.
diff --git a/doc/posix-functions/mbrtoc32.texi b/doc/posix-functions/mbrtoc32.texi
index 3528114bec..9690dd047d 100644
--- a/doc/posix-functions/mbrtoc32.texi
+++ b/doc/posix-functions/mbrtoc32.texi
@@ -2,9 +2,9 @@
@section @code{mbrtoc32}
@findex mbrtoc32
-Gnulib module: mbrtoc32
+Gnulib module: mbrtoc32 or mbrtoc32-regular
-Portability problems fixed by Gnulib:
+Portability problems fixed by either Gnulib module @code{mbrtoc32} or @code{mbrtoc32-regular}:
@itemize
@item
This function is missing on most non-glibc platforms:
@@ -35,19 +35,25 @@
@c See https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/mbrtoc16-mbrtoc323
@end itemize
-Portability problems not fixed by Gnulib:
+Portability problems fixed by Gnulib module @code{mbrtoc32-regular}:
@itemize
@item
+This function can map some multibyte characters to a sequence of two or more
+Unicode characters, and may thus return @code{(size_t) -3}.
+No known implementation currently (2023) behaves that way, but it may
+theoretically happen.
+With the @code{mbrtoc32-regular} module, you have the guarantee that the
+Gnulib-provided @code{mbrtoc32} function maps each multibyte character to
+exactly one Unicode character and thus never returns @code{(size_t) -3}.
+@item
This function behaves incorrectly when converting precomposed characters
from the BIG5-HKSCS encoding:
@c https://sourceware.org/bugzilla/show_bug.cgi?id=30611
glibc 2.36.
-@item
-Although ISO C says this function can return @code{(size_t) -3},
-no known implementation behaves that way,
-and if it were to happen it would break common uses.
-If dealing with @code{(size_t) -3} would complicate your code significantly,
-it is probably better not to bother.
+@end itemize
+
+Portability problems not fixed by Gnulib:
+@itemize
@item
This function is only defined as an inline function on some platforms:
Haiku 2020.
diff --git a/lib/mbrtoc32.c b/lib/mbrtoc32.c
index 6a56d93a4b..96039f9480 100644
--- a/lib/mbrtoc32.c
+++ b/lib/mbrtoc32.c
@@ -126,6 +126,15 @@ mbrtoc32 (char32_t *pwc, const char *s, size_t n, mbstate_t *ps)
size_t ret = mbrtoc32 (pwc, s, n, ps);
# endif
+# if GNULIB_MBRTOC32_REGULAR
+ /* Verify that mbrtoc32 is regular. */
+ if (ret < (size_t) -3 && ! mbsinit (ps))
+ /* This occurs on glibc 2.36. */
+ memset (ps, '\0', sizeof (mbstate_t));
+ if (ret == (size_t) -3)
+ abort ();
+# endif
+
# if MBRTOC32_IN_C_LOCALE_MAYBE_EILSEQ
if ((size_t) -2 <= ret && n != 0 && ! hard_locale (LC_CTYPE))
{
diff --git a/modules/mbrtoc32-regular b/modules/mbrtoc32-regular
new file mode 100644
index 0000000000..e8ae236fc5
--- /dev/null
+++ b/modules/mbrtoc32-regular
@@ -0,0 +1,27 @@
+Description:
+mbrtoc32() function that maps each multibyte character to exactly one Unicode
+character and thus never returns (size_t)(-3).
+
+Files:
+
+Depends-on:
+mbrtoc32
+
+configure.ac:
+gl_MODULE_INDICATOR([mbrtoc32-regular])
+
+Makefile.am:
+
+Include:
+<uchar.h>
+
+Link:
+$(LTLIBUNISTRING) when linking with libtool, $(LIBUNISTRING) otherwise
+$(MBRTOWC_LIB)
+$(LTLIBC32CONV) when linking with libtool, $(LIBC32CONV) otherwise
+
+License:
+LGPLv2+
+
+Maintainer:
+Bruno Haible
--
2.34.1
>From 2d46fcdd3fa38139f3c3b6cbc3439363553ee0e7 Mon Sep 17 00:00:00 2001
From: Bruno Haible <[email protected]>
Date: Tue, 11 Jul 2023 00:06:14 +0200
Subject: [PATCH 2/2] mbrtoc32-regular: Add tests.
* tests/test-mbrtoc32-regular.c: New file.
* modules/mbrtoc32-regular-tests: New file.
---
ChangeLog | 4 ++
modules/mbrtoc32-regular-tests | 14 ++++++
tests/test-mbrtoc32-regular.c | 79 ++++++++++++++++++++++++++++++++++
3 files changed, 97 insertions(+)
create mode 100644 modules/mbrtoc32-regular-tests
create mode 100644 tests/test-mbrtoc32-regular.c
diff --git a/ChangeLog b/ChangeLog
index c8dc122aa4..3eb2e2bc4b 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,9 @@
2023-07-10 Bruno Haible <[email protected]>
+ mbrtoc32-regular: Add tests.
+ * tests/test-mbrtoc32-regular.c: New file.
+ * modules/mbrtoc32-regular-tests: New file.
+
mbrtoc32-regular: New module.
* modules/mbrtoc32-regular: New file.
* lib/mbrtoc32.c (mbrtoc32): If the module 'mbrtoc32-regular' is present
diff --git a/modules/mbrtoc32-regular-tests b/modules/mbrtoc32-regular-tests
new file mode 100644
index 0000000000..907f73721a
--- /dev/null
+++ b/modules/mbrtoc32-regular-tests
@@ -0,0 +1,14 @@
+Files:
+tests/test-mbrtoc32-regular.c
+tests/macros.h
+
+Depends-on:
+mbsinit
+setlocale
+
+configure.ac:
+
+Makefile.am:
+TESTS += test-mbrtoc32-regular
+check_PROGRAMS += test-mbrtoc32-regular
+test_mbrtoc32_regular_LDADD = $(LDADD) $(LIBUNISTRING) $(SETLOCALE_LIB) $(MBRTOWC_LIB) $(LIBC32CONV)
diff --git a/tests/test-mbrtoc32-regular.c b/tests/test-mbrtoc32-regular.c
new file mode 100644
index 0000000000..a85a0a5a69
--- /dev/null
+++ b/tests/test-mbrtoc32-regular.c
@@ -0,0 +1,79 @@
+/* Test of conversion of multibyte character to 32-bit wide character.
+ Copyright (C) 2023 Free Software Foundation, Inc.
+
+ This program is free software: you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation, either version 3 of the License, or
+ (at your option) any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU General Public License for more details.
+
+ You should have received a copy of the GNU General Public License
+ along with this program. If not, see <https://www.gnu.org/licenses/>. */
+
+/* Written by Bruno Haible <[email protected]>, 2023. */
+
+#include <config.h>
+
+#include <uchar.h>
+
+#include <locale.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <uchar.h>
+#include <wchar.h>
+
+#include "macros.h"
+
+int
+main (int argc, char *argv[])
+{
+ /* The only locales in which mbrtoc32 may map a multibyte character to a
+ sequence of two or more Unicode characters are those with BIG5-HKSCS
+ encoding. See
+ <https://lists.gnu.org/archive/html/bug-gnulib/2023-06/msg00134.html>
+ <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00014.html> */
+ if (setlocale (LC_ALL, "zh_HK.BIG5-HKSCS") == NULL)
+ {
+ fprintf (stderr, "Skipping test: found no locale with BIG5-HKSCS encoding.\n");
+ return 77;
+ }
+
+ /* The problematic BIG5-HKSCS characters are:
+
+ input maps to name
+ ----- ------- ----
+ 0x88 0x62 U+00CA U+0304 LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND MACRON
+ 0x88 0x64 U+00CA U+030C LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND CARON
+ 0x88 0xA3 U+00EA U+0304 LATIN SMALL LETTER E WITH CIRCUMFLEX AND MACRON
+ 0x88 0xA5 U+00EA U+030C LATIN SMALL LETTER E WITH CIRCUMFLEX AND CARON
+
+ Test one of them.
+ See <https://sourceware.org/bugzilla/show_bug.cgi?id=30611>. */
+ mbstate_t state;
+ memset (&state, '\0', sizeof (mbstate_t));
+ char32_t c32 = (char32_t) 0xBADFACE;
+ size_t ret = mbrtoc32 (&c32, "\210\142", 2, &state);
+ /* It is OK if this conversion fails. */
+ if (ret != (size_t)(-1))
+ {
+ /* mbrtoc32 being regular, means that STATE is in the initial state. */
+ ASSERT (mbsinit (&state));
+ ret = mbrtoc32 (&c32, "", 0, &state);
+ /* mbrtoc32 being regular, means that it returns (size_t)(-2), not
+ (size_t)(-3), here. */
+ ASSERT (ret == (size_t)(-2));
+ ret = mbrtoc32 (&c32, "", 1, &state);
+ /* mbrtoc32 being regular, means that it returns the null 32-bit wide
+ character, here, not any remnant from the previous multibyte
+ character. */
+ ASSERT (ret == 0);
+ ASSERT (c32 == 0);
+ }
+
+ return 0;
+}
--
2.34.1