On 07/04/2012 08:53 PM, Graydon Hoare wrote: > On 12-07-04 6:55 AM, Behdad Esfahbod wrote: >> * Here: "\xHH, \uHHHH, \UHHHHHHHH Unicode escapes", I strongly suggest >> that >> \xHH be modified to allow inputting direct UTF-8 bytes. For ASCII it doesn't >> make any different. For Latin1, it gives the impression that strings are >> stored in Latin1, which is not the case. It would also make C / Python >> escaped strings directly usable in Rust. Ie. '\xE2\x98\xBA' would be a >> single >> character equivalent to '\u263a', not three Latin1 characters. > > Heh. This is interesting! I hadn't noticed yet but you're not _entirely_ > giving the whole story. > > - \xNN means a utf8 byte: python2, python3 'bytes' literals, > perl, go, C, C++, C++-0x u8 literals, and ruby > > - \xNN means a unicode codepoint: python3 'string' literals, > javascript, scheme (at least racket follows spec; others > get it randomly wrong by implementation), and current rust. > > - \xNN illegal, but the octal version means a unicode codepoint: > java. > > So, my inclination is to follow your suggestion and actually go with the C > and C++ style. But it's not exactly universal!
personally, I find the current behavior of Rust less risky and more logical.
If you can write '\u263a', why would you want to write the cumbersome
'\xE2\x98\xBA' instead? Moreover, it's dangerous--just writing '\xE2\x98' or
'\xE2' would result in a broken UTF-8 string. Perl and C couldn't avoid that
since they are older then Unicode/UTF-8, but what would be the point of
allowing it in Rust?
No such danger exists in the current implementation, where every \xNN
sequence refers to a Unicode codepoint < 256 (which also happens to be
Latin1 character, but that's just because Unicode is a superset of Latin1).
The current implementation is simple and consistent: all escapes refer to
code points, none refers to bytes. If your code point is below 2^8, you can
use any of "\xHH, \u00HH, \U000000HH", if it's below 2^16, you can use
either of "\uHHHH, \U0000HHHH", otherwise you have to use "\UHHHHHHHH". Nice
and sane.
Admittedly, if string literals should be useful not only for entering UTF-8
sequences, but for entering arbitrary byte sequences ([u8]), than Behdad's
proposal makes more sense. But for such purposes, wouldn't it be better to
specify them directly as u8 vectors, e.g. [0xE2,0x98,0xBA] ?
Best regards
Christian
--
|------- Dr. Christian Siefkes ------- [email protected] -------
| Homepage: http://www.siefkes.net/ | Blog: http://www.keimform.de/
| Peer Production Everywhere: http://peerconomy.org/wiki/
|---------------------------------- OpenPGP Key ID: 0x346452D8 --
If people are good only because they fear punishment, and hope for reward,
then we are a sorry lot indeed.
-- Albert Einstein
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Rust-dev mailing list [email protected] https://mail.mozilla.org/listinfo/rust-dev
