On 01/06/2026 18:12, Thomas Wolff via Cygwin wrote:
Am 01.06.2026 um 17:59 schrieb Thomas Wolff via Cygwin:
Am 01.06.2026 um 17:34 schrieb Jakob Bohm via Cygwin:
Dear list,
Having read through the recent debate around the wcwidth() POSIX API,
wchar_t definitions, gcc-16 and cygwin, I have an idea not
mentioned in the list so far:
Using C17 types char32_t and char16_t, the situation can be
summarized as follows:
- Many, but not all POSIX systems define wchar_t as char32_t and thus
wint_t as uint_least32_t
- Win32 and thus Cygwin defines wchar_t as char16_t and thus wint_t as
uint_least16_t
- All systems considered treat wchar_t as unicode, with Win32
supporting
UTF-16 since the NT 5.00 (Windows 2000).
- For char16_t/UTF-16, wcwidth() should use the high surrogate to
determine the range of unicode symbols and return a width common to
that range, then return 0 for the low surrogates, thereby allowing
computation of string width without having to first assemble
surrogates
into full char32_t values. Deciding if char32_t implementations
should
still lump groups of 4 Unicode rows for UTF-16 compatibility is up to
each implementation.
It's a neat idea to split the width calculation over the surrogates.
Unfortunately it does not work this way because widthness does not
change in full 1024-byte blocks. For example, U+1F4FC is Wide,
U+1F4FD and U+1F4FE are narrow/Neutral (N), and U+1F4FF is W again.
As a variant of your idea, wcwidth could return width 1 for every
high surrogate, remember it, and if the subsequent invocation is a
low surrogate, determine the combined width and return either 1 or 0.
Not quite standard behaviour, I suspect, so maybe not a good idea for
the purists, but maybe worth some discussion.
On the other hand, there are also combining characters in the non-BMP,
so the only way this could work is width 0 for high surrogates, then
sum up to the actual width on the low surrogate. Leaving the question
how to handle an (errorneously) single high surrogate...
If using this "hidden state" concept, the big question is how to handle
a single or out-of-sync low surrogate in wcwidth(). For wcswidth(),
the full context is always available and lone surrogates will be no
different than other invalid chars such as U+1FFFFE .
A practical solution would be for Cygwin/newlib to provide new
functions
c16width(), c32width(), c16swidth() and c32swidth(), each being the
explicit size equivalants of their wc and wcs similarly named
functions.
Then wcwidth() can be a trivial inline alias of the explicit size
equivalent for the compile target by having the newlib header
checking a
compiler or standard define indicating the chosen size of wchar_t.
// possible wchar.h snippet
//
// C17+ required
// For C2Y+ this should go in uchar.h
//
int c16width(char16_t c);
int c32width(char32_t c);
int c16swidth(const char16_t *s, size_t n);
int c32swidth(const char32_t *s, size_t n);
// ...
// This belongs in wchar.h for C1x- compat
//
#if SOMETHING_MEANING_16bit_WCHAR_T
inline int wcwidth(wchar_t c) {
return c16width(c);
}
inline int wcswidth(const wchar_t *s, size_t n)
{
return c16swidth(s, n);
}
#else
inline int wcwidth(wchar_t c) {
return c32width(c);
}
inline int wcswidth(const wchar_t *s, size_t n)
{
return c32swidth(s, n);
}
#endif
Enjoy
Jakob
Enjoy
Jakob
--
Jakob Bohm, CIO, Partner, WiseMo A/S. https://www.wisemo.com
Transformervej 29, 2860 Søborg, Denmark. Direct +45 31 13 16 10
This public discussion message is non-binding and may contain errors.
WiseMo - Remote Service Management for PCs, Phones and Embedded
--
Problem reports: https://cygwin.com/problems.html
FAQ: https://cygwin.com/faq/
Documentation: https://cygwin.com/docs.html
Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple