On May 28, 2014, at 11:55 AM, Benjamin Striegel <ben.strie...@gmail.com> wrote:
> > Being too opinionated (regarding opinions that deviate from the norm) tends > > to put people off the language unless there's a clear benefit to forcing > > the alternative behavior. > > We have already chosen to be opinionated by enforcing UTF-8 in our strings. > This is an extension of that break with tradition. There's no clear tradition regarding strings. Some languages treat strings as just blobs of binary data with no associated encoding (and obviously, operate on bytes). Some languages use an associated encoding with every string, but those are pretty rare. Some languages, such as JavaScript and Obj-C, use UCS-2 (well, Obj-C tries to be UTF-16 but all of its accessors that operate on characters actually operate on UTF-16 code units, which is effectively equivalent to UCS-2). > > Today we only need to teach the simple concept that strings are utf-8 > > encoded > > History has shown that understanding Unicode is not a simple concept. Asking > for the "length" of a Unicode string is not a well-formed question, and we > must express this in our API. I also don't agree with accessor functions that > work on code units without warning, and for this reason I strongly disagree > with supporting the [] operator on strings. Unicode is not a simple concept. UTF-8 on the other hand is a pretty simple concept. And string accessors that operate at the code unit level are very common (in fact, I can't think of a single language that doesn't operate on code units by default[1][2]). Pretty much the only odd part about Rust's behavior here is that the slicing methods (with the exception of slice_chars()) will fail if the byte index isn't on a character boundary, but that's a natural extension of the fact that Rust strings are guaranteed to be valid utf-8. And it's unrelated to the naming (even if it were called .byte_slice() it would still fail with the same input; and honestly, .byte_slice() looks like it will return a &[u8]). Of course, we haven't mentioned .byte_slice() before, but if you're going to rename .len() to .byte_len() you're going to have to add .byte_ prefixes to all of the other methods that take byte indexes. In any case, the core idea here is that .len() returns "the length" of the string. And "the length" is the number of code units. This matches the behavior of other languages. -Kevin [1]: Even Haskell can be said to operate on code units, as its built-in string is a linked list of UTF-32 characters, which means the code unit is the character. Although I don't know offhand how Data.Text or Data.ByteString work. [2]: Python 2.7 operates on bytes, but I just did some poking around in Python3 and it seems to use characters for length and indexing. I don't know what the internal representation of a Python3 string is, though, so I don't know if they're using O(n) operations, or if they're using UTF-16/UTF-32 internally as necessary. > On Wed, May 28, 2014 at 2:42 PM, Kevin Ballard <ke...@sb.org> wrote: > Breaking with established convention is a dangerous thing to do. Being too > opinionated (regarding opinions that deviate from the norm) tends to put > people off the language unless there's a clear benefit to forcing the > alternative behavior. > > In this case, there's no compelling benefit to naming the thing .byte_len() > over merely documenting that .len() is in code units. Everything else that > doesn't explicitly say "char" on strings is in code units too, so it's > sensible that .len() is too. But having strings that don't have an inherent > "length" is confusing to anyone who hasn't already memorized this difference. > > Today we only need to teach the simple concept that strings are utf-8 > encoded, and the corresponding notion that all of the accessor methods on > strings (including indexing using []) use code units unless they specify > otherwise (e.g. unless they contain the word "char"). > > -Kevin > > On May 28, 2014, at 10:54 AM, Benjamin Striegel <ben.strie...@gmail.com> > wrote: > >> > People expect there to be a .len() >> >> This is the assumption that I object to. People expect there to be a .len() >> because strings have been fundamentally broken since time immemorial. Make >> people type .byte_len() and be explicit about their desire to index via code >> units. >> >> >> On Wed, May 28, 2014 at 1:12 PM, Kevin Ballard <ke...@sb.org> wrote: >> It's .len() because slicing and other related functions work on byte indexes. >> >> We've had this discussion before in the past. People expect there to be a >> .len(), and the only sensible .len() is byte length (because char length is >> not O(1) and not appropriate for use with most string-manipulation >> functions). >> >> Since Rust strings are UTF-8 encoded text, it makes sense for .len() to be >> the number of UTF-8 code units. Which happens to be the number of bytes. >> >> -Kevin >> >> On May 28, 2014, at 7:07 AM, Benjamin Striegel <ben.strie...@gmail.com> >> wrote: >> >>> I think that the naming of `len` here is dangerously misleading. Naive >>> ASCII-users will be free to assume that this is counting codepoints rather >>> than bytes. I'd prefer the name `byte_len` in order to make the behavior >>> here explicit. >>> >>> >>> On Wed, May 28, 2014 at 5:55 AM, Simon Sapin <simon.sa...@exyr.org> wrote: >>> On 28/05/2014 10:46, Aravinda VK wrote: >>> Thanks. I didn't know about char_len. >>> `unicode_str.as_slice().char_len()` is giving number of code points. >>> >>> Sorry for the confusion, I was referring codepoint as character in my >>> mail. char_len gives the correct output for my requirement. I have >>> written javascript script to convert from string length to grapheme >>> cluster length for Kannada language. >>> >>> Be careful, JavaScript’s String.length counts UCS-2 code units, not code >>> points… >>> >>> >>> -- >>> Simon Sapin >>> _______________________________________________ >>> Rust-dev mailing list >>> Rust-dev@mozilla.org >>> https://mail.mozilla.org/listinfo/rust-dev >>> >>> _______________________________________________ >>> Rust-dev mailing list >>> Rust-dev@mozilla.org >>> https://mail.mozilla.org/listinfo/rust-dev >> >> >> _______________________________________________ >> Rust-dev mailing list >> Rust-dev@mozilla.org >> https://mail.mozilla.org/listinfo/rust-dev > > > _______________________________________________ > Rust-dev mailing list > Rust-dev@mozilla.org > https://mail.mozilla.org/listinfo/rust-dev
_______________________________________________ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev