On 4/19/26 14:31, Paul Eggert wrote:
On 2026-04-19 11:21, [email protected] wrote:
It may not be exactly the same as this grep report, but it's related.

Yes, that sounds plausible. However, my guess is that the problem occurs only in older macOS versions, which use now-obsolete case-conversion tables.

Maybe we can fix Gnulib regex to work even on older macOS (as well as on OpenBSD), i.e., to treat dž as matching Dž when ignoring case even though macOS itself does not do so. Come to think of it, this might improve Gnulib regex performance on GNU/Linux. I'll add that to my long list of things to do.


Just for the record and with a tad more verbosity and locale options :

/*
   See GNU GREP bug report :
   https://lists.gnu.org/archive/html/bug-grep/2026-04/msg00027.html

   This appears to be an incompatibility in OpenBSD, which mishandles
   titlecase characters. In an en_US.UTF-8 locale OpenBSD's towupper
   function treats the character "Dž" (U+01C5 LATIN CAPITAL LETTER D WITH
   SMALL LETTER Z WITH CARON) differently from GNU/Linux. If you call
   towupper (0x01C5) on OpenBSD it returns 0x01C5, that is, it acts as if
   this character is uppercase. However, it's titlecase, not uppercase. It
should uppercase to "DŽ", i.e., to U+01C4 LATIN CAPITAL LETTER DZ WITH CARON.

   Small test code should output "towupper (0x01C5) = 0x01C4" and on
   OpenBSD it may output "towupper (0x01C5) = 0x01C".

*/

/*
 * This code should be C90 clean and therefore we may use :
 *
 *                 #define _XOPEN_SOURCE 500
 *
 * NOTE: for reasons yet unknown OpenBSD 7.8 has a fit if you try
 *       to define _XOPEN_SOURCE and be damned if I know why.
 */

#if ! defined(__OpenBSD__)
#if ! defined (_XOPEN_SOURCE)
#define _XOPEN_SOURCE 500
#endif
#endif

#include <locale.h>
#include <wctype.h>
#include <stdio.h>
#include <stdlib.h>

int
main( int argc, char **argv )
{
    /* NOTE : wchar.h - wide-character types */
    wint_t c, w;
    char *buf;

    /* assume a trivial POSIX locale */
    buf = setlocale( LC_ALL, "POSIX" );
    if ( buf == NULL ) {
        fprintf (stderr,"FAIL : setlocale fail\n");
        return EXIT_FAILURE;
    }

    if ( argc > 1 ) {
        printf("\nINFO : You suggest a locale of %s\n", argv[1]);
        buf = setlocale( LC_ALL, argv[1] );
        /* The return value is NULL if the request can not be done */
        if ( buf == NULL ) {
            fprintf(stderr,"FAIL : * * * locale request failed * * *\n");
            fprintf(stderr,"     : ---------------------------------\n");
            fprintf(stderr,"     : please check your available list\n");
            fprintf(stderr,"     : of supported locales:\n");
            fprintf(stderr,"     : use \"locale -a\".\n");
            return EXIT_FAILURE;
        }
        printf("     : accepted.\n");
    } else {
        printf("\nINFO : locale is set to default \"POSIX\".\n\n");
    }

    c = 0x01C5;
    w = towupper(c);

printf("\n(U+01C5 LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON)\n\n");

    printf("towupper (0x01C5) = 0x%04X\n", (int) w);

    printf("\n\nShould output \"towupper (0x01C5) = 0x01C4\".\n");
    printf("Buggy stuff may output \"towupper (0x01C5) = 0x01C\"\n");
    printf("OpenBSD 7.8 is even more strange and reports 0x01C5\n\n");

    return EXIT_SUCCESS;

}

Are we sure about this test code ?

Everywhere I look the test fails on OpenBSD and FreeBSD and even Solaris.


(1) OpenBSD 7.8 AMD64 : ( -std=iso9899:1990 -pedantic -pedantic-errors )

eris$ ./test_wchar

INFO : locale is set to default "POSIX".


(U+01C5 LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON)

towupper (0x01C5) = 0x01C5


Should output "towupper (0x01C5) = 0x01C4".
Buggy stuff may output "towupper (0x01C5) = 0x01C"
OpenBSD 7.8 is even more strange and reports 0x01C5

eris$
eris$ ./test_wchar de_DE.UTF-8

INFO : You suggest a locale of de_DE.UTF-8
     : accepted.

(U+01C5 LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON)

towupper (0x01C5) = 0x01C5


Should output "towupper (0x01C5) = 0x01C4".
Buggy stuff may output "towupper (0x01C5) = 0x01C"
OpenBSD 7.8 is even more strange and reports 0x01C5

eris$

It really does not matter what locale I try. Always the same.

2) FreeBSD 15.0 with all the latest patches also shows strange output :

hydra$ uname -a
FreeBSD hydra 15.0-RELEASE-p5 FreeBSD 15.0-RELEASE-p5 GENERIC amd64
hydra$
hydra$ ./test_wchar

INFO : locale is set to default "POSIX".


(U+01C5 LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON)

towupper (0x01C5) = 0x01C5


Should output "towupper (0x01C5) = 0x01C4".
Buggy stuff may output "towupper (0x01C5) = 0x01C"
OpenBSD 7.8 is even more strange and reports 0x01C5

hydra$

hydra$ ./test_wchar en_US.UTF-8

INFO : You suggest a locale of en_US.UTF-8
     : accepted.

(U+01C5 LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON)

towupper (0x01C5) = 0x01C4


Should output "towupper (0x01C5) = 0x01C4".
Buggy stuff may output "towupper (0x01C5) = 0x01C"
OpenBSD 7.8 is even more strange and reports 0x01C5

hydra$

3) FreeBSD 16.0-CURRENT built from sources

callisto$ uname -a
FreeBSD callisto 16.0-CURRENT FreeBSD 16.0-CURRENT main-n285053-3e27114a7f96 GENERIC amd64
callisto$ ./test_wchar

INFO : locale is set to default "POSIX".


(U+01C5 LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON)

towupper (0x01C5) = 0x01C5


Should output "towupper (0x01C5) = 0x01C4".
Buggy stuff may output "towupper (0x01C5) = 0x01C"
OpenBSD 7.8 is even more strange and reports 0x01C5

callisto$
callisto$ ./test_wchar en_US.UTF-8

INFO : You suggest a locale of en_US.UTF-8
     : accepted.

(U+01C5 LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON)

towupper (0x01C5) = 0x01C4


Should output "towupper (0x01C5) = 0x01C4".
Buggy stuff may output "towupper (0x01C5) = 0x01C"
OpenBSD 7.8 is even more strange and reports 0x01C5

callisto$

           * * * T H I S    I S    I N T E R E S T I N G * * *

4) Solaris 11.4 latest edition on ORACLE SPARC S7-2 with the Oracle
   Studio compiler tools works with locale en_US.UTF-8

neptune $ uname -a
SunOS neptune 5.11 11.4.90.212.0 sun4v sparc sun4v non-virtualized
neptune $ ./test_wchar

INFO : locale is set to default "POSIX".


(U+01C5 LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON)

towupper (0x01C5) = 0x01C5


Should output "towupper (0x01C5) = 0x01C4".
Buggy stuff may output "towupper (0x01C5) = 0x01C"
OpenBSD 7.8 is even more strange and reports 0x01C5

neptune $ ./test_wchar en_US.UTF-8

INFO : You suggest a locale of en_US.UTF-8
     : accepted.

(U+01C5 LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON)

towupper (0x01C5) = 0x01C4


Should output "towupper (0x01C5) = 0x01C4".
Buggy stuff may output "towupper (0x01C5) = 0x01C"
OpenBSD 7.8 is even more strange and reports 0x01C5

neptune $

So that works on Solaris 11.4 with the locale en_US.UTF-8 and also :

neptune $
neptune $ ./test_wchar de_DE.UTF-8

INFO : You suggest a locale of de_DE.UTF-8
     : accepted.

(U+01C5 LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON)

towupper (0x01C5) = 0x01C4


Should output "towupper (0x01C5) = 0x01C4".
Buggy stuff may output "towupper (0x01C5) = 0x01C"
OpenBSD 7.8 is even more strange and reports 0x01C5

neptune $

5) Solaris 10 reasonably patched and with ORACLE Studio 12.6

$
$ uname -a
SunOS hubble 5.10 Generic_150400-67 sun4v sparc sun4v
$ $CC -V
cc: Studio 12.6 Sun C 5.15 SunOS_sparc 2017/05/30
$
$ ./test_wchar

INFO : locale is set to default "POSIX".


(U+01C5 LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON)

towupper (0x01C5) = 0x01C5


Should output "towupper (0x01C5) = 0x01C4".
Buggy stuff may output "towupper (0x01C5) = 0x01C"
OpenBSD 7.8 is even more strange and reports 0x01C5

$ ./test_wchar en_US.UTF-8

INFO : You suggest a locale of en_US.UTF-8
     : accepted.

(U+01C5 LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON)

towupper (0x01C5) = 0x01C5


Should output "towupper (0x01C5) = 0x01C4".
Buggy stuff may output "towupper (0x01C5) = 0x01C"
OpenBSD 7.8 is even more strange and reports 0x01C5

$ ./test_wchar ja_JP.UTF-8

INFO : You suggest a locale of ja_JP.UTF-8
     : accepted.

(U+01C5 LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON)

towupper (0x01C5) = 0x01C5


Should output "towupper (0x01C5) = 0x01C4".
Buggy stuff may output "towupper (0x01C5) = 0x01C"
OpenBSD 7.8 is even more strange and reports 0x01C5

$

6) Red Hat Enterprise Linux 10 on SiFive RISC-V P550

rhel10_rv5$ uname -a
Linux sedna.bw.genunix.com 6.12.0-89.rv.0.el10.riscv64 #1 SMP PREEMPT_DYNAMIC Wed May 28 20:21:05 UTC 2025 riscv64 GNU/Linux
rhel10_rv5$
rhel10_rv5$ cat /etc/redhat-release
Red Hat Enterprise Linux release 10.0 (Coughlan)
rhel10_rv5$

rhel10_rv5$ $CC --version
gcc (GCC) 14.2.1 20250110 (Red Hat 14.2.1-7)
Copyright (C) 2024 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

rhel10_rv5$ echo $CFLAGS
-std=iso9899:1990 -pedantic -pedantic-errors -g -O0 -fno-builtin -fno-unsafe-math-optimizations -march=rv64imafdc -mabi=lp64d
rhel10_rv5$ echo $CPPFLAGS
-D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -D_XOPEN_SOURCE=500
rhel10_rv5$

rhel10_rv5$ $CC $CFLAGS $CPPFLAGS -o test_wchar test_wchar.c
rhel10_rv5$
rhel10_rv5$ ./test_wchar

INFO : locale is set to default "POSIX".


(U+01C5 LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON)

towupper (0x01C5) = 0x01C5


Should output "towupper (0x01C5) = 0x01C4".
Buggy stuff may output "towupper (0x01C5) = 0x01C"
OpenBSD 7.8 is even more strange and reports 0x01C5

rhel10_rv5$
rhel10_rv5$ ./test_wchar en_US.UTF-8

INFO : You suggest a locale of en_US.UTF-8
     : accepted.

(U+01C5 LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON)

towupper (0x01C5) = 0x01C4


Should output "towupper (0x01C5) = 0x01C4".
Buggy stuff may output "towupper (0x01C5) = 0x01C"
OpenBSD 7.8 is even more strange and reports 0x01C5

rhel10_rv5$


So I am just looking at various systems and libC or GNU libC and yet
not MUSL yet. I may ponder that. Getting results all over the place.
Not sure if any of this helps.


--
--
Dennis Clarke
RISC-V/SPARC/PPC/ARM/CISC
UNIX and Linux spoken



Reply via email to