On 07/04/2012 08:53 PM, Graydon Hoare wrote:
> On 12-07-04 6:55 AM, Behdad Esfahbod wrote:
>>    * Here: "\xHH, \uHHHH, \UHHHHHHHH Unicode escapes", I strongly suggest
>> that
>> \xHH be modified to allow inputting direct UTF-8 bytes.  For ASCII it doesn't
>> make any different.  For Latin1, it gives the impression that strings are
>> stored in Latin1, which is not the case.  It would also make C / Python
>> escaped strings directly usable in Rust.  Ie. '\xE2\x98\xBA' would be a
>> single
>> character equivalent to '\u263a', not three Latin1 characters.
> 
> Heh. This is interesting! I hadn't noticed yet but you're not _entirely_
> giving the whole story.
> 
>   - \xNN means a utf8 byte: python2, python3 'bytes' literals,
>     perl, go, C, C++, C++-0x u8 literals, and ruby
> 
>   - \xNN means a unicode codepoint: python3 'string' literals,
>     javascript, scheme (at least racket follows spec; others
>     get it randomly wrong by implementation), and current rust.
> 
>   - \xNN illegal, but the octal version means a unicode codepoint:
>     java.
> 
> So, my inclination is to follow your suggestion and actually go with the C
> and C++ style. But it's not exactly universal!

personally, I find the current behavior of Rust less risky and more logical.
If you can write '\u263a', why would you want to write the cumbersome
'\xE2\x98\xBA' instead? Moreover, it's dangerous--just writing '\xE2\x98' or
'\xE2' would result in a broken UTF-8 string. Perl and C couldn't avoid that
since they are older then Unicode/UTF-8, but what would be the point of
allowing it in Rust?

No such danger exists in the current implementation, where every \xNN
sequence refers to a Unicode codepoint < 256 (which also happens to be
Latin1 character, but that's just because Unicode is a superset of Latin1).
The current implementation is simple and consistent: all escapes refer to
code points, none refers to bytes. If your code point is below 2^8, you can
use any of "\xHH, \u00HH, \U000000HH", if it's below 2^16, you can use
either of "\uHHHH, \U0000HHHH", otherwise you have to use "\UHHHHHHHH". Nice
and sane.

Admittedly, if string literals should be useful not only for entering UTF-8
sequences, but for entering arbitrary byte sequences ([u8]), than Behdad's
proposal makes more sense. But for such purposes, wouldn't it be better to
specify them directly as u8 vectors, e.g. [0xE2,0x98,0xBA] ?

Best regards
        Christian

-- 
|------- Dr. Christian Siefkes ------- [email protected] -------
| Homepage: http://www.siefkes.net/ | Blog: http://www.keimform.de/
|    Peer Production Everywhere:       http://peerconomy.org/wiki/
|---------------------------------- OpenPGP Key ID: 0x346452D8 --
If people are good only because they fear punishment, and hope for reward,
then we are a sorry lot indeed.
        -- Albert Einstein

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev

Reply via email to