Re: [rust-dev] How to find Unicode string length in rustlang

Benjamin Striegel Wed, 28 May 2014 13:27:06 -0700

> There's no clear tradition regarding strings.

Excellent, then surely nobody has any right to expect a method named .len()
:)


> Unicode is not a simple concept. UTF-8 on the other hand is a pretty
simple concept.

I don't think we can fully divorce these two ideas. Understanding UTF-8
still implies understanding the difference between code points, code units,
and grapheme clusters. If we have a single unadorned `len` function, that
implies the existence of a "default" length to a UTF-8 string, which is a
lie. It also *fails* to suggest the existence of alternative measures of
length of a UTF-8 string. Finally, the choice of byte length as the default
length metric encourages the horrid status quo, which is the perpetuation
of code that is tested and works in ASCII environments but barfs as soon as
anyone from a sufficiently-foreign culture tries to use it. Dedicating
ourselves to Unicode support does us no good if the remainder of our API
encourages the depressingly-typical ASCII-ism that pervades nearly every
other language.


On Wed, May 28, 2014 at 3:48 PM, Kevin Ballard <ke...@sb.org> wrote:

> On May 28, 2014, at 11:55 AM, Benjamin Striegel <ben.strie...@gmail.com>
> wrote:
>
> > Being too opinionated (regarding opinions that deviate from the norm)
> tends to put people off the language unless there's a clear benefit to
> forcing the alternative behavior.
>
> We have already chosen to be opinionated by enforcing UTF-8 in our
> strings. This is an extension of that break with tradition.
>
>
> There's no clear tradition regarding strings. Some languages treat strings
> as just blobs of binary data with no associated encoding (and obviously,
> operate on bytes). Some languages use an associated encoding with every
> string, but those are pretty rare. Some languages, such as JavaScript and
> Obj-C, use UCS-2 (well, Obj-C tries to be UTF-16 but all of its accessors
> that operate on characters actually operate on UTF-16 code units, which is
> effectively equivalent to UCS-2).
>
> > Today we only need to teach the simple concept that strings are utf-8
> encoded
>
> History has shown that understanding Unicode is not a simple concept.
> Asking for the "length" of a Unicode string is not a well-formed question,
> and we must express this in our API. I also don't agree with accessor
> functions that work on code units without warning, and for this reason I
> strongly disagree with supporting the [] operator on strings.
>
>
> Unicode is not a simple concept. UTF-8 on the other hand is a pretty
> simple concept. And string accessors that operate at the code unit level
> are *very* common (in fact, I can't think of a single language that
> doesn't operate on code units by default[1][2]). Pretty much the only odd
> part about Rust's behavior here is that the slicing methods (with the
> exception of slice_chars()) will fail if the byte index isn't on a
> character boundary, but that's a natural extension of the fact that Rust
> strings are guaranteed to be valid utf-8. And it's unrelated to the naming
> (even if it were called .byte_slice() it would still fail with the same
> input; and honestly, .byte_slice() looks like it will return a &[u8]).
>
> Of course, we haven't mentioned .byte_slice() before, but if you're going
> to rename .len() to .byte_len() you're going to have to add .byte_ prefixes
> to all of the other methods that take byte indexes.
>
> In any case, the core idea here is that .len() returns "the length" of the
> string. And "the length" is the number of code units. This matches the
> behavior of other languages.
>
> -Kevin
>
> [1]: Even Haskell can be said to operate on code units, as its built-in
> string is a linked list of UTF-32 characters, which means the code unit is
> the character. Although I don't know offhand how Data.Text or
> Data.ByteString work.
>
> [2]: Python 2.7 operates on bytes, but I just did some poking around in
> Python3 and it seems to use characters for length and indexing. I don't
> know what the internal representation of a Python3 string is, though, so I
> don't know if they're using O(n) operations, or if they're using
> UTF-16/UTF-32 internally as necessary.
>
> On Wed, May 28, 2014 at 2:42 PM, Kevin Ballard <ke...@sb.org> wrote:
>
>> Breaking with established convention is a dangerous thing to do. Being
>> too opinionated (regarding opinions that deviate from the norm) tends to
>> put people off the language unless there's a clear benefit to forcing the
>> alternative behavior.
>>
>> In this case, there's no compelling benefit to naming the thing
>> .byte_len() over merely documenting that .len() is in code units.
>> Everything else that doesn't explicitly say "char" on strings is in code
>> units too, so it's sensible that .len() is too. But having strings that
>> don't have an inherent "length" is confusing to anyone who hasn't already
>> memorized this difference.
>>
>> Today we only need to teach the simple concept that strings are utf-8
>> encoded, and the corresponding notion that all of the accessor methods on
>> strings (including indexing using []) use code units unless they specify
>> otherwise (e.g. unless they contain the word "char").
>>
>> -Kevin
>>
>> On May 28, 2014, at 10:54 AM, Benjamin Striegel <ben.strie...@gmail.com>
>> wrote:
>>
>> > People expect there to be a .len()
>>
>> This is the assumption that I object to. People expect there to be a
>> .len() because strings have been fundamentally broken since time
>> immemorial. Make people type .byte_len() and be explicit about their desire
>> to index via code units.
>>
>>
>> On Wed, May 28, 2014 at 1:12 PM, Kevin Ballard <ke...@sb.org> wrote:
>>
>>> It's .len() because slicing and other related functions work on byte
>>> indexes.
>>>
>>> We've had this discussion before in the past. People expect there to be
>>> a .len(), and the only sensible .len() is byte length (because char length
>>> is not O(1) and not appropriate for use with most string-manipulation
>>> functions).
>>>
>>> Since Rust strings are UTF-8 encoded text, it makes sense for .len() to
>>> be the number of UTF-8 code units. Which happens to be the number of bytes.
>>>
>>> -Kevin
>>>
>>> On May 28, 2014, at 7:07 AM, Benjamin Striegel <ben.strie...@gmail.com>
>>> wrote:
>>>
>>> I think that the naming of `len` here is dangerously misleading. Naive
>>> ASCII-users will be free to assume that this is counting codepoints rather
>>> than bytes. I'd prefer the name `byte_len` in order to make the behavior
>>> here explicit.
>>>
>>>
>>> On Wed, May 28, 2014 at 5:55 AM, Simon Sapin <simon.sa...@exyr.org>wrote:
>>>
>>>> On 28/05/2014 10:46, Aravinda VK wrote:
>>>>
>>>>> Thanks. I didn't know about char_len.
>>>>> `unicode_str.as_slice().char_len()` is giving number of code points.
>>>>>
>>>>> Sorry for the confusion, I was referring codepoint as character in my
>>>>> mail. char_len gives the correct output for my requirement. I have
>>>>> written javascript script to convert from string length to grapheme
>>>>> cluster length for Kannada language.
>>>>>
>>>>
>>>> Be careful, JavaScript’s String.length counts UCS-2 code units, not
>>>> code points…
>>>>
>>>>
>>>> --
>>>> Simon Sapin
>>>> _______________________________________________
>>>> Rust-dev mailing list
>>>> Rust-dev@mozilla.org
>>>> https://mail.mozilla.org/listinfo/rust-dev
>>>>
>>>
>>> _______________________________________________
>>> Rust-dev mailing list
>>> Rust-dev@mozilla.org
>>> https://mail.mozilla.org/listinfo/rust-dev
>>>
>>>
>>>
>> _______________________________________________
>> Rust-dev mailing list
>> Rust-dev@mozilla.org
>> https://mail.mozilla.org/listinfo/rust-dev
>>
>>
>>
> _______________________________________________
> Rust-dev mailing list
> Rust-dev@mozilla.org
> https://mail.mozilla.org/listinfo/rust-dev
>
>
>

_______________________________________________
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev

Re: [rust-dev] How to find Unicode string length in rustlang

Reply via email to