Re: LC_CTYPE=UTF-8

shwaresyst Thu, 25 Jun 2020 19:11:53 -0700

There are plans for this, having a POSIX.UTF-8 locale as an XSI base 
requirement. There may be POSIX.UTF-E and UTF-I locales too; same features, 
simply the different charmaps. As options there may even be, albeit this is 
unlikely as no platform I'm aware of fully supports ISO-6429 now, a POSIX.ISO-7 
and POSIX.ISO-8 specification as well. Because c11 and c17 are fundamentally 
broken, with only a minimal partial fix slated for c2x, there are no viable 
plans for a C.UTF-8 or C.UTF-E proposal that I've ever seen.

However, the way the standard is written now only the repertoire that 
transforms to a single byte encoding may be used, and is what the c2x fix 
limits itself to. This is effectively normative support only of ASCII-68, not 
ISO-646 or 10646. Expanding support to include some of the 2 byte graphic 
repertoire is already permitted by POSIX, but not required. Making allowances 
for most of the UCS2 repertoire is fairly easy, including its 3 byte UTF-8 
representations, but the text for this, and the significant changes for the 4 
byte form needed for full UCS-4 and UTF-16 support, is still to be proposed.

The point is it is still too early, in my opinion, to say what additional 
capabilities these locales will provide to applications to ease multi-lingual 
portability. Of the four choices I see the second or third as the minimum 
desireable. The industry as a whole needs to communicate how much of Unicode 
they want to be supported in Issue 8 or they will be stuck with the minimal 
represented by ASCII-68. Whatever is decided upon, bug fixes and breaking 
changes to non-portable aspects of existing implementations to be conforming to 
the final formal specification of the locale are to be expected.
On Thursday, June 25, 2020 Ingo Schwarze <schwa...@usta.de> wrote:
Hi Alan,

Alan Coopersmith wrote on Thu, Jun 25, 2020 at 07:59:39AM -0700:
> On 6/25/20 6:33 AM, Hans Aberg wrote:

>> Perhaps there should be a default UTF-8 locale: It seems that the
>> current construct does not apply so well to it.

> If the goal is to standardize existing behavior the standard could define
> the C.UTF-8 locale (or perhaps a POSIX.UTF-8 locale) that a number of
> systems already have, which is the standard C/POSIX locale with just the
> character set changed to UTF-8 instead.

This idea makes a lot of sense to me.

If the Austin Group decides that it wants to go into that direction,
i would make sure that both OpenBSD and the software i publish use
that name for a locale with these properties and consistently
recommend using that name.  Both already support a locale with these
properties and select it if the user asks for C.UTF-8 or POSIX.UTF-8,
but so far, they recommend that users specify en_US.UTF-8 (for
historical reasons), which is a bit unfortunate because it looks
like requesting cultural conventions for a particular country, which
is not the intention.

Whether to standardize only C.UTF-8 or both C.UTF-8 and POSIX.UTF-8
as synonyms looks a bit like asking for the best colour of a bikeshed.
Given that the standard already contains the redundancy of requiring
both "C" and "POSIX", maybe it is more consistent to also require
both "C.UTF-8" and "POSIX.UTF-8", but i don't think that matters
greatly.

Yours,
  Ingo

Re: LC_CTYPE=UTF-8

Reply via email to