Moritz Lenz wrote:
NotFound wrote:
To open another can of worms, I think that we can live without
character set specification. We can stablish that the character set is
always unicode, and to deal only with encodings.

We had that discussion already, and the answer was "no" for several reasons:
* Strings might contain binary data, it doesn't make sense to view them
as Unicode
* Unicode isn't necessarily universal, or might stop to be so in future.
If a character is not representable in Unicode, and you chose to use
Unicode for everything, you're screwed
* related to the previous point, some other character encodings might
not have a lossless round-trip conversion.

Yes, we can never assume Unicode as the character set, or restrict Parrot to only handling the Unicode character set.

Ascii is an encoding
that maps directly to codepoints and only allows 0-127 values.
iso-8859-1 is the same with 0-255 range. Any other 8 bit encoding just
need a translation table. The only point to solve is we need some
special way to work with fixed-8 with no intended character
representation.

Introducing the "no character set" character set is just a special case
of arbitrary character sets. I see no point in using the special case
over the generic one.

The thing is, there's a tendency for data for a particular program or application to all be from the same character set (if, for example, you're parsing a series of files, munging the data in some way, and writing out a series of files as a result). We never want to force all data to be transformed into one "canonical" character set, because it significantly increases the cost of working in data from different character sets, and the chances of corrupting that data in the process. If someone is reading, modifying, and writing EBCDIC files, they shouldn't have to translate their data to an intermediate format and back again.

Allison

Reply via email to