Re: c++ strings and UTF-8 (other charsets)

Rich Felker Tue, 27 Feb 2007 13:01:16 -0800

On Tue, Feb 27, 2007 at 09:49:50AM -0500, ＳｒｉｎＴｕａｒ wrote:
> On Mon, Feb 26, 2007 at 03:35:05PM +0100, Stephane Bortzmeyer wrote:
> >> Old code doesn't need to be ported.
> >
> >Very strange advice, indeed.
> 
> You might want to read up on the history of UTF-8.


Here are some references for anyone wanting to do so:
http://www.cl.cam.ac.uk/~mgk25/ucs/UTF-8-Plan9-paper.pdf
http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

> Not needed to make any code changes at all to most applications was in
> fact one of the primary design goal of the encoding.

I'd like to expand on and strengthen this statement a bit: the goal
was not just to avoid making code changes, but to avoid requirements
on text that would be fundamentally incompatible with some of the most
powerful tools in the unix model. UTF-16 (or at that time, UCS-2) not
only broke the API of standard C and unix; it also broke the
statelessness and robustness of text and the ability to treat it as
binary byte streams in pipes, etc. due to byte order issues and BOM.
This could have been avoided only by redefining the atomic data unit
(byte) to be 16 (or later 21 :) bits, which would in turn have
required scrapping and replacing every octet-based internet protocol..

Hopefully a good understanding of the history and motivations behind
UTF-8 makes it clear that UTF-8 is not (as Windows and Java fans try
to portrary it) a backwards-compatibility hack, but instead a
fundamentally better encoding scheme which allows powerful unix data
processing principles to continue to be used with text. It's a shame
the history isn't better-known.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: c++ strings and UTF-8 (other charsets)

Reply via email to