On 12-07-04 1:12 PM, Christian Siefkes wrote:

personally, I find the current behavior of Rust less risky and more logical.
If you can write '\u263a', why would you want to write the cumbersome
'\xE2\x98\xBA' instead? Moreover, it's dangerous--just writing '\xE2\x98' or
'\xE2' would result in a broken UTF-8 string. Perl and C couldn't avoid that
since they are older then Unicode/UTF-8, but what would be the point of
allowing it in Rust?

Oh, a good point, but we wouldn't accept it during parsing. I don't want to get into the game of allowing strings in that aren't valid utf8. Use a [u8] for that.

The string-specific reasons I can see for this are:

  - You want to denote some utf8 bytes and you want to avoid doing
    the work of figuring out which codepoint it decodes to. For
    example if you were writing a crude tool that emitted rust string
    literals by doing byte-at-a-time copies of text files.

  - You want to copy a string literal from C or C++.

Neither of these are _great_ reasons, but they feel like enough to consider the change. I'm not actually sure how to interpret the "risk" Behdad suggested of users thinking strings are latin-1 (as in: why they would, and how to mitigate that). I mean, maybe if the user believed that \xNN was the only escape form, no longer escapes? I don't know, it's 2012 and I am sort of perplexed that anyone would think strings would be anything other than unicode-of-some-sort. Anyone looking at \xNN and wanting to write longer escapes would, I expect, google "rust unicode escapes", or try writing "\uNNNN" or something :)

No such danger exists in the current implementation, where every \xNN
sequence refers to a Unicode codepoint < 256 (which also happens to be
Latin1 character, but that's just because Unicode is a superset of Latin1).
The current implementation is simple and consistent: all escapes refer to
code points, none refers to bytes. If your code point is below 2^8, you can
use any of "\xHH, \u00HH, \U000000HH", if it's below 2^16, you can use
either of "\uHHHH, \U0000HHHH", otherwise you have to use "\UHHHHHHHH". Nice
and sane.

I agree. This is the counterargument and the one I had in mind when picking the current scheme. Any other feelings / rationales for deciding one way or another? I'm not super clear on which way to go on this.

Admittedly, if string literals should be useful not only for entering UTF-8
sequences, but for entering arbitrary byte sequences ([u8]), than Behdad's
proposal makes more sense. But for such purposes, wouldn't it be better to
specify them directly as u8 vectors, e.g. [0xE2,0x98,0xBA] ?

Definitely. This is really only an interop question, in my mind, not an expressivity one. That is: how likely is our behavior to be an unwelcome surprise when someone's trying to do something specific with a string-literal?

-Graydon
_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev

Reply via email to