Re: [HACKERS] invalidly encoded strings

Tatsuo Ishii Mon, 10 Sep 2007 08:32:55 -0700

> Andrew Dunstan <[EMAIL PROTECTED]> writes:
> > The reason we are prepared to make an exception for Unicode is precisely 
> > because the code point maps to an encoding pattern independently of 
> > architecture, ISTM.
> 
> Right --- there is a well-defined standard for the numerical value of
> each character in Unicode.  And it's also clear what to do in
> single-byte encodings.  It's not at all clear what the representation
> ought to be for other multibyte encodings.  A direct transliteration
> of the byte sequence not only has endianness issues, but will have
> a weird non-dense set of valid values because of the restrictions on
> valid multibyte characters.
> 
> Given that chr() has never before behaved sanely for multibyte values at
> all, extending it to Unicode code points is a reasonable extension,
> and throwing error for other encodings is reasonable too.  If we ever do
> come across code-point standards for other encodings we can adopt 'em at
> that time.


I don't understand whole discussion.

Why do you think that employing the Unicode code point as the chr()
argument could avoid endianness issues? Are you going to represent
Unicode code point as UCS-4? Then you have to specify the endianness
anyway.  (see the UCS-4 standard for more details)

Or are you going to represent Unicode point as a character string such
as 'U+0259'? Then representing any encoding as a string could avoid
endianness issues anyway, and I don't see Unicode code point is any
better than others.

Also I'd like to point out all encodings has its own code point
systems as far as I know. For example, EUC-JP has its corresponding
code point systems, ASCII, JIS X 0208 and JIS X 0212. So I don't see
we can't use "code point" as chr()'s argument for othe encodings(of
course we need optional parameter specifying which character set is
supposed).
--
Tatsuo Ishii
SRA OSS, Inc. Japan

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] invalidly encoded strings

Reply via email to