Unicode

Paul Prescod Fri, 21 Sep 2001 15:38:00 -0700
Dan Sugalski wrote:
> 
>...
> >
> >Is it really a good idea for the meaning of your Perl program to change
> >in this way between platforms?
> 
> Beats having ord('A') show as 65 on an EBCDIC platform.

This is a philosophical question of course.

I guess it depends on whether you think that encoding is/should be a
fundamental aspect of platforms anymore. There was a day when you could
guess what hardware you were running from the operating system but then
operating systems became portable. Similarly, encodings are now
portable. I can easily work with EBCDIC data on Windows or Linux and
mainframes can work with Unicode (otherwise how do they handle XML and
Java?). So what is an EBCDIC platform?

>>> u"abc".encode(ebcdic).decode(ebcdic).encode('ascii')
'abc'

> If the code in question wants a default character set, that can be specified.

With what scope? Global? Module? Function? Runtime or dynamic?

>...
> 
> I am also willing to put up with some hair if it means things go faster
> many places. I've my eye on China and Japan (amongst other places) as
> targets for parrot, and Unicode's not gonna cut it there.

I disagree but this is another philosophical issue.

Microsoft has standardized on Unicode. Yes there are many encodings
floating around and will be for years, but Microsoft's standard
*character set* is Unicode. Java has standardized on Unicode also.
JavaScript, XML and the Web have standardized on Unicode. I'm making a
strong distinction between the character *set* and the character
*encoding* because none of these technologies have standardized on a
single character *encoding*.

I believe that those technologies are likely to "take off" in Japan and
China. What kind of computing are they doing in Japan that is divorced
from Microsoft software, the Web, Java etc?

I've never heard of a Japanese competitor to XML nor even Japanse
complaints about it. I know many Japanese people who use. 

  http://www.w3.org/TR/japanese-xml/

"This document is a submission to the World Wide Web Consortium from
Xerox, Panasonic, Toshiba, GLOCOM, Academia Sinica, Alis Technologies,
Sun Microsystems."

"This technical report and [XML] treat Shift-JIS, an ordinary Japanese
charset in Japan, as a CES that represents Japanese characters and
[US-ASCII] characters in [ISO/IEC10646] or [Unicode 3.0]. For full
interoperability in the Internet, migration from Shift-JIS to
UTF-8/UTF-16 is highly recommended. "

They suggest (surprisingly!) not only use of the Unicode character set
but even the standard Unicode encodings.

Furthermore, the Unicode character set has tens of thousands of empty
spaces. If Chinese and Japanese computer scientists scream loudly enough
they can have their characters separated into different planes. In that
case using a non-Unicode character set would be almost perverse:
"Character set A is a superset of character set B and they have
identical textual encodings but I prefer character set B because it maps
the Kanji character for tree to character point 1211 instead of 99773."

So Unicode is bound to win either way. Either it is "good enough" or it
can be changed to be good enough.

In theory Parrot could have all of the benefits of Unicode and still
support other character sets, but as a software developer I would not
take on that burden when the world is moving so strongly in a
standarized direction. It is of course your burden and thus your choice.
:-) If you figure out all of the details, maybe Python would copy you.

 Paul Prescod
Unicode

Reply via email to