Re: encoding vs charset

Allison Randal Wed, 16 Jul 2008 20:21:20 -0700

Moritz Lenz wrote:

NotFound wrote:

To open another can of worms, I think that we can live without
character set specification. We can stablish that the character set is

always unicode, and to deal only with encodings.


We had that discussion already, and the answer was "no" for several reasons:
* Strings might contain binary data, it doesn't make sense to view them
as Unicode
* Unicode isn't necessarily universal, or might stop to be so in future.
If a character is not representable in Unicode, and you chose to use
Unicode for everything, you're screwed
* related to the previous point, some other character encodings might
not have a lossless round-trip conversion.

Yes, we can never assume Unicode as the character set, or restrictParrot to only handling the Unicode character set.

Ascii is an encoding
that maps directly to codepoints and only allows 0-127 values.
iso-8859-1 is the same with 0-255 range. Any other 8 bit encoding just
need a translation table. The only point to solve is we need some
special way to work with fixed-8 with no intended character
representation.


Introducing the "no character set" character set is just a special case
of arbitrary character sets. I see no point in using the special case
over the generic one.

The thing is, there's a tendency for data for a particular program orapplication to all be from the same character set (if, for example,you're parsing a series of files, munging the data in some way, andwriting out a series of files as a result). We never want to force alldata to be transformed into one "canonical" character set, because itsignificantly increases the cost of working in data from differentcharacter sets, and the chances of corrupting that data in the process.If someone is reading, modifying, and writing EBCDIC files, theyshouldn't have to translate their data to an intermediate format andback again.


Allison

Re: encoding vs charset

Reply via email to