Re: [rust-dev] How to find Unicode string length in rustlang

Kevin Ballard Wed, 28 May 2014 12:48:22 -0700

On May 28, 2014, at 11:55 AM, Benjamin Striegel <[email protected]> wrote:

> > Being too opinionated (regarding opinions that deviate from the norm) tends 
> > to put people off the language unless there's a clear benefit to forcing 
> > the alternative behavior.
> 
> We have already chosen to be opinionated by enforcing UTF-8 in our strings. 
> This is an extension of that break with tradition.

There's no clear tradition regarding strings. Some languages treat strings as 
just blobs of binary data with no associated encoding (and obviously, operate 
on bytes). Some languages use an associated encoding with every string, but 
those are pretty rare. Some languages, such as JavaScript and Obj-C, use UCS-2 
(well, Obj-C tries to be UTF-16 but all of its accessors that operate on 
characters actually operate on UTF-16 code units, which is effectively 
equivalent to UCS-2).

> > Today we only need to teach the simple concept that strings are utf-8 
> > encoded
> 
> History has shown that understanding Unicode is not a simple concept. Asking 
> for the "length" of a Unicode string is not a well-formed question, and we 
> must express this in our API. I also don't agree with accessor functions that 
> work on code units without warning, and for this reason I strongly disagree 
> with supporting the [] operator on strings.

Unicode is not a simple concept. UTF-8 on the other hand is a pretty simple 
concept. And string accessors that operate at the code unit level are very 
common (in fact, I can't think of a single language that doesn't operate on 
code units by default[1][2]). Pretty much the only odd part about Rust's 
behavior here is that the slicing methods (with the exception of slice_chars()) 
will fail if the byte index isn't on a character boundary, but that's a natural 
extension of the fact that Rust strings are guaranteed to be valid utf-8. And 
it's unrelated to the naming (even if it were called .byte_slice() it would 
still fail with the same input; and honestly, .byte_slice() looks like it will 
return a &[u8]).

Of course, we haven't mentioned .byte_slice() before, but if you're going to 
rename .len() to .byte_len() you're going to have to add .byte_ prefixes to all 
of the other methods that take byte indexes.

In any case, the core idea here is that .len() returns "the length" of the 
string. And "the length" is the number of code units. This matches the behavior 
of other languages.

-Kevin

[1]: Even Haskell can be said to operate on code units, as its built-in string 
is a linked list of UTF-32 characters, which means the code unit is the 
character. Although I don't know offhand how Data.Text or Data.ByteString work.

[2]: Python 2.7 operates on bytes, but I just did some poking around in Python3 
and it seems to use characters for length and indexing. I don't know what the 
internal representation of a Python3 string is, though, so I don't know if 
they're using O(n) operations, or if they're using UTF-16/UTF-32 internally as 
necessary.

> On Wed, May 28, 2014 at 2:42 PM, Kevin Ballard <[email protected]> wrote:
> Breaking with established convention is a dangerous thing to do. Being too 
> opinionated (regarding opinions that deviate from the norm) tends to put 
> people off the language unless there's a clear benefit to forcing the 
> alternative behavior.
> 
> In this case, there's no compelling benefit to naming the thing .byte_len() 
> over merely documenting that .len() is in code units. Everything else that 
> doesn't explicitly say "char" on strings is in code units too, so it's 
> sensible that .len() is too. But having strings that don't have an inherent 
> "length" is confusing to anyone who hasn't already memorized this difference.
> 
> Today we only need to teach the simple concept that strings are utf-8 
> encoded, and the corresponding notion that all of the accessor methods on 
> strings (including indexing using []) use code units unless they specify 
> otherwise (e.g. unless they contain the word "char").
> 
> -Kevin
> 
> On May 28, 2014, at 10:54 AM, Benjamin Striegel <[email protected]> 
> wrote:
> 
>> > People expect there to be a .len()
>> 
>> This is the assumption that I object to. People expect there to be a .len() 
>> because strings have been fundamentally broken since time immemorial. Make 
>> people type .byte_len() and be explicit about their desire to index via code 
>> units.
>> 
>> 
>> On Wed, May 28, 2014 at 1:12 PM, Kevin Ballard <[email protected]> wrote:
>> It's .len() because slicing and other related functions work on byte indexes.
>> 
>> We've had this discussion before in the past. People expect there to be a 
>> .len(), and the only sensible .len() is byte length (because char length is 
>> not O(1) and not appropriate for use with most string-manipulation 
>> functions).
>> 
>> Since Rust strings are UTF-8 encoded text, it makes sense for .len() to be 
>> the number of UTF-8 code units. Which happens to be the number of bytes.
>> 
>> -Kevin
>> 
>> On May 28, 2014, at 7:07 AM, Benjamin Striegel <[email protected]> 
>> wrote:
>> 
>>> I think that the naming of `len` here is dangerously misleading. Naive 
>>> ASCII-users will be free to assume that this is counting codepoints rather 
>>> than bytes. I'd prefer the name `byte_len` in order to make the behavior 
>>> here explicit.
>>> 
>>> 
>>> On Wed, May 28, 2014 at 5:55 AM, Simon Sapin <[email protected]> wrote:
>>> On 28/05/2014 10:46, Aravinda VK wrote:
>>> Thanks. I didn't know about char_len.
>>> `unicode_str.as_slice().char_len()` is giving number of code points.
>>> 
>>> Sorry for the confusion, I was referring codepoint as character in my
>>> mail. char_len gives the correct output for my requirement. I have
>>> written javascript script to convert from string length to grapheme
>>> cluster length for Kannada language.
>>> 
>>> Be careful, JavaScript’s String.length counts UCS-2 code units, not code 
>>> points…
>>> 
>>> 
>>> -- 
>>> Simon Sapin
>>> _______________________________________________
>>> Rust-dev mailing list
>>> [email protected]
>>> https://mail.mozilla.org/listinfo/rust-dev
>>> 
>>> _______________________________________________
>>> Rust-dev mailing list
>>> [email protected]
>>> https://mail.mozilla.org/listinfo/rust-dev
>> 
>> 
>> _______________________________________________
>> Rust-dev mailing list
>> [email protected]
>> https://mail.mozilla.org/listinfo/rust-dev
> 
> 
> _______________________________________________
> Rust-dev mailing list
> [email protected]
> https://mail.mozilla.org/listinfo/rust-dev

_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev

Re: [rust-dev] How to find Unicode string length in rustlang

Reply via email to