Re: UTF-8 wakeup call

Keld Jďż˝rn Simonsen Sat, 07 Dec 2002 06:22:55 -0800

On Fri, Dec 06, 2002 at 11:39:43AM -0500, Henry Spencer wrote:
> On Fri, 6 Dec 2002, Keld =?iso-8859-1?Q?J=F8rn?= Simonsen wrote:
> > Actually it is funny that you call it Unicode. UTF-8 clearly comes from
> > the 10646 side of UCS, Unicode did not invent it at all...
> 
> It did not come from 10646 either; it came from the *Unix* side of the
> house, specifically from X/Open.  And my understanding is that it was
> originally specifically an encoding for Unicode (although the distinction
> quickly became academic because of the conversion of 10646 into a Unicode
> clone).


UTF-8 came then thru the 10646 side. Unicode was strictly 16-bit from
the outset. Per-merger 10646 was 32 bit with an 8-bit encoding made possible.
After the merger 10646 had an 8-bit encoding called UTF-1.

> Nobody except some standards zombies cared about encoding 10646, or indeed
> about any aspect of 10646; Unicode was the standard that the real world
> was clearly signing up for.  Which is why the 10646 committee, seeing the
> writing on the wall, abandoned its own efforts and aligned its standard
> with Unicode. 

Hmm, what is implemeted in Linux is really not Unicode, but 10646, and
was that from the outset, eg 32 bit wchar_t. But then you may call Linux people for
standard zombies, that is your call, and yes, we use standards like
POSIX and ISO C etc. Anyway I was writing about linux and UTF-8.

> Some of the Unicode standards guys were dead-set against any encoding
> except plain 16-bit (but which byte order? :-)), but potential *users* of
> Unicode were much more pragmatic.  UTF-8 originally came out of the desire
> for a backward-compatible encoding for use in Unix filenames. 

Yes, true. And that is then why I find it funny that the people
that were dead-set against anything other than 16 bit, now gets all
the glory for the stuff they fought so hard. The irony of history:-)

> In any case, Unicode is much the more widely-known name, and much the more
> readily available standard, and (as others have noted) also comes with a
> lot of relevant supplementary information that 10646 lacks. 

The supplementary information is much covered in 14651 and 14652.
And the specifications in Linux are then build on these standards
not Unicode tables.

> > The way 10646 is coming to Linux is also much
> > with the support from the ISO 14651 sorting standard and the ISO
> > TR 14652 locale standard. 
> 
> My understanding is that an ISO TR, by definition, is not a standard.

The definitions in ISO on what is standards in general, encompasses
ISO TRs as standadrs. It is not an ISO standard, but it is a standard
in the generic sense of the word.

> > I think the proper way to characterize what we do now in Linux is
> > to say ISO 10646, and probably mention Unicode in parenthesis the first
> > time it appears.
> 
> The pragmatic, and historically correct, way is the reverse.  ISO 10646
> delivers the ISO stamp (stomp? :-)) of approval for Unicode... but the
> standard you will find on the shelves of the people who do the work is
> labelled "Unicode".

I then think you are mislabelling the use of UTF-8 in Linux, as the
Unicode standards are not adhered to. Linux UTF-8 is not Unicode
conformant, but follows ISO 10646, 14651 and TR 14652.

Kind regards
keld
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: UTF-8 wakeup call

Reply via email to