On Sun, 12 Aug 2001, Gurusamy Sarathy wrote:

> On Sun, 12 Aug 2001 12:46:13 CDT, Jarkko Hietaniemi wrote:
> >(not talking about pod in particular in the following)
> >
> >So you think we should declare and document that in future versions
> >(5.10?) we should adopt Unicode and UTF-8 across the board, meaning
> >that things like chr(), ord(), pack C, unpack C, \xHH should start
> >being strictly Unicode and UTF-8?  For example, chr(0x41) being the
> >Unicode uppercase A, and ord("A") returning 0x41, ***EVERYWHERE***?
> 
> Yes.
> 
> >(Naturally, to ease the transition, some sort of conversion tools for
> >places like EBCDIC would be needed.)
> 
> Such as an I/O filter on PL_rsfp that could be enabled by default if
> so dwimmed, perhaps.

I think that the "Unicode everywhere" presecription is a nice idea but
is a bit too simplistic and does not solve some real world legacy encoding
problems.  I'd like to take an example from some extemely small code
character sets to illustrate the problem.  Imagine that I have a perl
port, or pod2foo implementation that currently works across these three
severely limited coded character sets:

   @X = ('A','B','C','a','b','c','!','#');
   @Y = ('A','B','C','a','b','c','=','+');
   @Z = ('a','b','c','A','B','C','#','!');

For folks who have trouble following this simple example these are
coded character sets: the zeroth element of @X is equivalent to the 66th
element of the ASCII, ISO 8859-1, and Unicode coded character sets (in
other words the perl array index is the numeric codepoint).

There are several things worth noting about these very simple coded
character sets:

 * each of them can be mapped into and out of the Unicode
   coded character set.  Since the Unicode Standard states as its goal
   the encoding of all computer character sets this may not come as a
   surprise.
 * None of the above is equivalent to ASCII.  ASCII has 'A' at codepoint
   65, whereas @X and @Y have 'A' at codepoint 0 and @Z has it at
   codepoint 3.
 * I can map set @X into the @Z encoding and vice versa.  I could even map
   @X to Unicode then take the Unicode codepoints and map them into the @Z
   encoding as an intermediate pivot point if I'd like to do it that way.
 * I could huff and puff and write Encode *.ucm files all day long but
   there is no way that I can map all of the @X encoding into the @Y
   encoding.  Even though they contain the same number of characters they
   nevertheless contain incompatabile character sets.  The same
   can be said of trying to map @Z to @Y, although the partial mapping of
   @X to @Y is a bit more compatible since they contain common coded
   character sub sets (they only differ in the last two codepoints and
   characters).

Note that if the "compatability problem" of trying to get documents
prepared in either CCS @X or @Z to work "as expected" in CCX @Y then I
have to allow for the fact that they are incompatible and not all
characters will be rendered correctly (most will, but two characters will
be rendered incorrectly).  Resorting to a separate coded character set, be
it ASCII, Unicode, Baudot, Morse, or what ever you wish will not solve the
problem.

How long will it be before all computer systems internally handle Unicode
flawlessly?  (Who thought EBCDIC would still be alive these days? Not me).  
It will take a long time, particularly in light of the looseness of
encoding forms allowed: UTF-8, UTF-16BE, UTF-16LE, etc. The wonders of
Unicode do not address incompatabilty of legacy sets. Unfortunately users
do think that it is the programming language's fault for not understanding
their font preferences.

For folks who didn't quite follow this: in the above replace @X with
@ISO_8859_1, replace @Y with either @MacRoman or @Windows_cp_1252, replace
@Z with @IBM_1047 and the analogue to the larger character sets still
pertains.

Peter Prymmer



Reply via email to