Re: Thoughts on the wcwidth confusion

Jakob Bohm via Cygwin Mon, 01 Jun 2026 11:01:16 -0700

On 01/06/2026 18:12, Thomas Wolff via Cygwin wrote:

Am 01.06.2026 um 17:59 schrieb Thomas Wolff via Cygwin:
Am 01.06.2026 um 17:34 schrieb Jakob Bohm via Cygwin:
Dear list,

Having read through the recent debate around the wcwidth() POSIX API,
wchar_t definitions, gcc-16 and cygwin, I have an idea not
mentioned in the list so far:

Using C17 types char32_t and char16_t, the situation can be
summarized as follows:

- Many, but not all POSIX systems define wchar_t as char32_t and thus
wint_t as uint_least32_t

- Win32 and thus Cygwin defines wchar_t as char16_t and thus wint_t as
uint_least16_t
- All systems considered treat wchar_t as unicode, with Win32supporting
 UTF-16 since the NT 5.00 (Windows 2000).

- For char16_t/UTF-16, wcwidth() should use the high surrogate to
 determine the range of unicode symbols and return a width common to
 that range, then return 0 for the low surrogates, thereby allowing
computation of string width without having to first assemblesurrogates into full char32_t values. Deciding if char32_t implementationsshould
 still lump groups of 4 Unicode rows for UTF-16 compatibility is up to
 each implementation.
It's a neat idea to split the width calculation over the surrogates.Unfortunately it does not work this way because widthness does notchange in full 1024-byte blocks. For example, U+1F4FC is Wide,U+1F4FD and U+1F4FE are narrow/Neutral (N), and U+1F4FF is W again.As a variant of your idea, wcwidth could return width 1 for everyhigh surrogate, remember it, and if the subsequent invocation is alow surrogate, determine the combined width and return either 1 or 0.Not quite standard behaviour, I suspect, so maybe not a good idea forthe purists, but maybe worth some discussion.
On the other hand, there are also combining characters in the non-BMP,so the only way this could work is width 0 for high surrogates, thensum up to the actual width on the low surrogate. Leaving the questionhow to handle an (errorneously) single high surrogate...

If using this "hidden state" concept, the big question is how to handle
a single or out-of-sync low surrogate in wcwidth().  For wcswidth(),
the full context is always available and lone surrogates will be no
different than other invalid chars such as U+1FFFFE .

A practical solution would be for Cygwin/newlib to provide newfunctions

c16width(), c32width(), c16swidth() and c32swidth(), each being the

explicit size equivalants of their wc and wcs similarly namedfunctions.


Then wcwidth() can be a trivial inline alias of the explicit size

equivalent for the compile target by having the newlib headerchecking a

compiler or standard define indicating the chosen size of wchar_t.

// possible wchar.h snippet
//
// C17+ required
// For C2Y+ this should go in uchar.h
//
int c16width(char16_t c);
int c32width(char32_t c);
int c16swidth(const char16_t *s, size_t n);
int c32swidth(const char32_t *s, size_t n);

// ...

// This belongs in wchar.h for C1x- compat
//
#if SOMETHING_MEANING_16bit_WCHAR_T
inline int wcwidth(wchar_t c) {
  return c16width(c);
}
inline int wcswidth(const wchar_t *s, size_t n)
{
  return c16swidth(s, n);
}
#else
inline int wcwidth(wchar_t c) {
  return c32width(c);
}
inline int wcswidth(const wchar_t *s, size_t n)
{
  return c32swidth(s, n);
}
#endif


Enjoy

Jakob

Enjoy

Jakob
--
Jakob Bohm, CIO, Partner, WiseMo A/S.  https://www.wisemo.com
Transformervej 29, 2860 Søborg, Denmark.  Direct +45 31 13 16 10
This public discussion message is non-binding and may contain errors.
WiseMo - Remote Service Management for PCs, Phones and Embedded


--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

Re: Thoughts on the wcwidth confusion

Reply via email to