Hi Ingo, On Thu, Sep 21, 2023 at 03:04:24PM +0200, Ingo Schwarze wrote: > In general, the tool for checking the validity of UTF-8 strings > is a simple loop around mblen(3) if you want to report the precise > positions of errors found, or simply mbstowcs(3) with a NULL pwcs > argument if you are content with a one-bit "valid" or "invalid" answer.
Acording to mbstowcs(3): ------------------------------------------------------------------------ RETURN VALUES mbstowcs() returns: 0 or positive The value returned is the number of elements stored in the array pointed to by pwcs, except for a terminating null wide character (if any). If pwcs is not null and the value returned is equal to n, the wide-character string pointed to by pwcs is not null terminated. If pwcs is a null pointer, the value returned is the number of elements to contain the whole string converted, except for a terminating null wide character. (size_t)-1 The array indirectly pointed to by s contains a byte sequence forming invalid character. In this case, mbstowcs() sets errno to indicate the error. ERRORS mbstowcs() may cause an error in the following cases: [EILSEQ] s points to the string containing invalid or incomplete multibyte character. ------------------------------------------------------------------------ To understand what mbstowcs(3) does I wrote the little test.c program pasted at bottom. In the following example [a] is UTF-8 aaculte and (a) iso-latin aacute. Using setlocale(LC_CTYPE, "en_US.UTF-8"); $ cc -g -Wall test.c $ echo -n arbol | a.out ulen: 5 $ echo -n [a]rbol | a.out ulen: 5 $ echo -n (a)rbol | a.out ulen: 5 Using setlocale(LC_CTYPE, "C"); $ cc -g -Wall test.c $ echo -n arbol | a.out ulen: 5 $ echo -n [a]rbol | a.out ulen: 6 $ echo -n (a)rbol | a.out ulen: 7 And no error message in any case. I don't understand in which way those return values let me know that the third string is invalid UTF-8. Am I doing something wrong? test.c ======================================== #include <stdio.h> #include <stdlib.h> #include <locale.h> int main() { int c, i; size_t ulen; char s[100]; i = 0; while ((c = getchar()) != EOF) s[i++] = c; s[i] = '\0'; setlocale(LC_CTYPE, "en_US.UTF-8"); //setlocale(LC_CTYPE, "C"); if ((ulen = mbstowcs(NULL, s, 0)) == (size_t)-1) perror("error"); printf("ulen: %zu\n", ulen); return 0; } -- Walter