Chris Hall skribis 2008-03-12 13:20 (+0000): > >> OK. In the meantime IMHO chr(n) should be handling utf8 and has no > >> business worrying about things which UTF-8 or UCS think aren't > >> characters. > >It should do Unicode, not any specific byte encoding, like UTF-?8. > IMHO chr(n) should do characters, which may be interpreted as per > Unicode, but may not. > When I said utf8 I was following the (sloppy) convention that utf8 means > how Perl handles characters in strings...
I'm working hard to break this convention. I've changed a lot of Perl documentation, and the result was released with Perl 5.10. If in any place in Perl's official documentation, it still reads UTF-8 or UTF8 for *characters in text strings*, it's wrong. Let me know and I will fix it :) > b. in a Perl string, characters are held in a UTF-8 like form. I'd say *inside* a Perl string. This is the C implementation, but a Perl programmer should not have to know the specific *internal* encoding of a Perl string. Likewise, in Perl you don't have to know whether your number is internally encoded as a long integer or a double. > Where UTF-8 (upper case, with hyphen) means the RFC 3629 & > Unicode Consortium defined byte-wise encoding. That's the theory, but it's so often not entirely following spec. > This form is referred to as utf8 (lower case, no hyphen). Yes, but note that encoding names in Perl are case insensitive. I tend to call it UTF8 sometimes. > There is really no need to discuss this, except in the context of > messing around in guts of Perl. Exactly. > String literals are represented by UCS code points. Which > reinforces the feeling that characters in Perl are Unicode. Yes! > 'C' uses 'wide' to refer to characters that may have values > > 255. IMHO it's a shame that Perl did not follow this. It does in some places, most notably warnings about "wide characters". > d. when exchanging character data with other systems one needs to > deal with character set and encoding issues. Not just other systems. All I/O is done in bytes, even with yourself, for example if you forked. > "Isolated surrogate code units have no interpretation on > their own." > (...) > Clearly these are illegal in UTF-8. They have no interpretation, but this also doesn't say it's illegal. Compare it with the undefined behavior of multiple ++ in a single expression. There's no specification of what should happen, but it's not illegal to do it. > "Applications are free to use any of these noncharacter code > points internally but should never attempt to exchange > them. I think it's not Perl's job to prevent exchange. Simply because the exchange could be internal, but between processes of the same program. > I'm puzzled as to why 'UTF-8' (strict) doesn't treat U+FFFE (and > friends) in the same way as U+FFFF (and friends). My gut says it's out of ignorance of the "rules", and certainly not an intentional deviation. > >The result is Unicode. > IMHO the result of chr(n) should just be a character. We call that a unicode character in Perl. It is true that Perl allows ordinal values outside the currently existing range, but it is still called unicode by Perl's documentation. > OK, sure. I was using utf8 to mean any character value you like, and > UTF-8 to imply a value which is recognised in UCS -- rather than the > encoding. Please use utf8 only for naming the byte encoding that allows any character value you like, not for the ordinal values themselves. > FWIW I note that printf "%vX" is suggested as a means to render IPv6 > addresses. This implies the use of a string containing eight characters > 0..0xFFFF as the packed form of IPv6. Building one of those using > chr(n) will generate spurious warnings about 0xFFFE and 0xFFFF ! Interesting point. -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker <[EMAIL PROTECTED]> <http://juerd.nl/sig> Convolution: ICT solutions and consultancy <[EMAIL PROTECTED]>