On Tue, Feb 27, 2007 at 09:49:50AM -0500, SrinTuar wrote: > On Mon, Feb 26, 2007 at 03:35:05PM +0100, Stephane Bortzmeyer wrote: > >> Old code doesn't need to be ported. > > > >Very strange advice, indeed. > > You might want to read up on the history of UTF-8.
Here are some references for anyone wanting to do so: http://www.cl.cam.ac.uk/~mgk25/ucs/UTF-8-Plan9-paper.pdf http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt > Not needed to make any code changes at all to most applications was in > fact one of the primary design goal of the encoding. I'd like to expand on and strengthen this statement a bit: the goal was not just to avoid making code changes, but to avoid requirements on text that would be fundamentally incompatible with some of the most powerful tools in the unix model. UTF-16 (or at that time, UCS-2) not only broke the API of standard C and unix; it also broke the statelessness and robustness of text and the ability to treat it as binary byte streams in pipes, etc. due to byte order issues and BOM. This could have been avoided only by redefining the atomic data unit (byte) to be 16 (or later 21 :) bits, which would in turn have required scrapping and replacing every octet-based internet protocol.. Hopefully a good understanding of the history and motivations behind UTF-8 makes it clear that UTF-8 is not (as Windows and Java fans try to portrary it) a backwards-compatibility hack, but instead a fundamentally better encoding scheme which allows powerful unix data processing principles to continue to be used with text. It's a shame the history isn't better-known. Rich -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
