Re: [rust-dev] Unicode vs hex escapes in Rust

Graydon Hoare Thu, 05 Jul 2012 11:18:44 -0700

On 12-07-04 1:12 PM, Christian Siefkes wrote:

personally, I find the current behavior of Rust less risky and more logical.
If you can write '\u263a', why would you want to write the cumbersome
'\xE2\x98\xBA' instead? Moreover, it's dangerous--just writing '\xE2\x98' or
'\xE2' would result in a broken UTF-8 string. Perl and C couldn't avoid that
since they are older then Unicode/UTF-8, but what would be the point of
allowing it in Rust?

Oh, a good point, but we wouldn't accept it during parsing. I don't wantto get into the game of allowing strings in that aren't valid utf8. Usea [u8] for that.


The string-specific reasons I can see for this are:

  - You want to denote some utf8 bytes and you want to avoid doing
    the work of figuring out which codepoint it decodes to. For
    example if you were writing a crude tool that emitted rust string
    literals by doing byte-at-a-time copies of text files.

  - You want to copy a string literal from C or C++.

Neither of these are _great_ reasons, but they feel like enough toconsider the change. I'm not actually sure how to interpret the "risk"Behdad suggested of users thinking strings are latin-1 (as in: why theywould, and how to mitigate that). I mean, maybe if the user believedthat \xNN was the only escape form, no longer escapes? I don't know,it's 2012 and I am sort of perplexed that anyone would think stringswould be anything other than unicode-of-some-sort. Anyone looking at\xNN and wanting to write longer escapes would, I expect, google "rustunicode escapes", or try writing "\uNNNN" or something :)

No such danger exists in the current implementation, where every \xNN
sequence refers to a Unicode codepoint < 256 (which also happens to be
Latin1 character, but that's just because Unicode is a superset of Latin1).
The current implementation is simple and consistent: all escapes refer to
code points, none refers to bytes. If your code point is below 2^8, you can
use any of "\xHH, \u00HH, \U000000HH", if it's below 2^16, you can use
either of "\uHHHH, \U0000HHHH", otherwise you have to use "\UHHHHHHHH". Nice
and sane.

I agree. This is the counterargument and the one I had in mind whenpicking the current scheme. Any other feelings / rationales for decidingone way or another? I'm not super clear on which way to go on this.

Admittedly, if string literals should be useful not only for entering UTF-8
sequences, but for entering arbitrary byte sequences ([u8]), than Behdad's
proposal makes more sense. But for such purposes, wouldn't it be better to
specify them directly as u8 vectors, e.g. [0xE2,0x98,0xBA] ?

Definitely. This is really only an interop question, in my mind, not anexpressivity one. That is: how likely is our behavior to be an unwelcomesurprise when someone's trying to do something specific with astring-literal?


-Graydon
_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev

Re: [rust-dev] Unicode vs hex escapes in Rust

Reply via email to