Eli Zaretskii <[email protected]> wrote: > > From: [email protected] > > Date: Mon, 20 Apr 2026 07:43:16 -0600 > > Cc: [email protected], [email protected], [email protected], > > [email protected] > > > > If I understand it, wrapping things in braces helps, in that sorting > > will then be done on all the bytes in a multibyte-encoded letter > > instead of on just the first byte. > > OK, but that still leaves the question of whether the byte sequence > corresponding to {à} sorts before or after {É}, say. And Gawk > determines that by calling locale-dependent libc functions, doesn't > it? Or did you assume LC_ALL=C, which will cause Gawk work with > individual bytes? (And if so, what do other Awks do in that case?)
Hmmm... First, let's restrict this to gawk. Most other awks don't deal in wide characters. Gawk only converts to wide strings internally when needed, like for length(), index(), and so on. It looks like, if not ignoring case, for == and !=, strcmp() is used on the multibyte strings. Otherwise (<, <= etc), memcmp(). > IOW, why did you say strncmp and not wcscmp? wcscmp() isn't used at all in gawk. This may be a bug, but maybe not. It's largely irrelevant for texindex, which has to be portable. There's no perfect solution.
