I think returning length of string in bytes is just fine. Since I didn't
know about the availability of char_len in rust caused this confusion.
python 2.7/Perl/PHP - Returns length of string in bytes, Python 3/JS
returns number of codepoints.
As long as we can iterate through chars without worrying about bytes
length/codepoint length, then it should be fine.
let unicode_str = String::from_str("ಅರವಿಂದ");
let v: Vec<char> = unicode_str.as_slice().chars().collect();
for c in v.iter(){
println!("{}", c);
}
I wonder if chars() available for String itself, so that we can avoid
running as_slice().chars()
--
Regards
Aravinda | ಅರವಿಂದ
http://aravindavk.in
On Thu, May 29, 2014 at 11:17 AM, Kevin Ballard <[email protected]> wrote:
> On May 28, 2014, at 9:16 PM, Bardur Arantsson <[email protected]>
> wrote:
>
> > Rust:
> >
> > $ cat
> > fn main() {
> > let l = "hï".len(); // Note the accent
> > println!("{:u}", l);
> > }
> > $ rustc hello.rs
> > $ ./hello
> > 3
> >
> > No matter how defective the notion of "length" may be, personally I
> > think that people will expect the former, but will be very surprised by
> > the latter. There are certainly cases where the JavaScript version is
> > wrong, but I conjecture that it "works" for the vast majority of cases
> > that people and programs are likely to encounter.
>
> The JavaScript version is quite wrong. Isaac points out that NFC vs NFD
> can change the result, although that's really an issue with grapheme
> clusters vs codepoints. More interestingly, JavaScript's idea of string
> length is wrong for anything outside of the BMP:
>
> $ node
> > "𐀀".length
> 2
>
> This is because it was designed for UCS-2 instead of UTF-16, so .length
> actually returns the number of UCS-2 code units in the string.
>
> Incidentally, that means that JavaScript and Rust do have the same
> fundamental definition of length (which is to say, number of code units).
> They just have a different code unit. In JavaScript it's confusing because
> you can learn to use JavaScript quite well without ever realizing that it's
> UCS-2 code units (i.e. that it's not codepoints). In Rust, we're very clear
> that our strings are utf-8 sequences, so it should surprise nobody when the
> length turns out to be the number of utf-8 code units.
>
> FWIW, Go uses utf-8 code units as well, and nobody seems to be confused
> about that.
>
> -Kevin
> _______________________________________________
> Rust-dev mailing list
> [email protected]
> https://mail.mozilla.org/listinfo/rust-dev
>
>
_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev