Re: On the meaning of string.length

Upvoter via Digitalmars-d-announce Wed, 19 Nov 2014 07:11:27 -0800

On Wednesday, 19 November 2014 at 14:33:05 UTC, Adam D. Ruppewrote:

I answered a random C# stackoverflow question about whystring.length returns the value it does with some rationaledefending code units instead of "characters" - basically, Ityped up a defense of D's string-as-array behavior.
To my surprise, my answer got an enormous number of votes* so Idecided to post it to reddit too.
http://www.reddit.com/r/programming/comments/2mqghp/why_does_stringlength_count_code_units_instead_of/
This is really encouraging to me that there's been such apositive response. The question every so often comes up heretoo, people saying string.length should give number ofcharacters, and of course, we have the automatic UTF decodingdone in Phobos that comes up from time to time.
It looks like D, the language, made the right decisions here.

This reddit comment applies to the phobos thing though:
"Most people like to pick on surrogate pairs here, and decrylanguages which don't handle them "properly", but I think it'simportant to point out that handling surrogate pairs as asingle character doesn't in any way fix the underlying issue --many multiple-codepoint sequences are a single logical glypheven if you use 32 bit wide chars."
I know this has been said a lot of times... but I think theauto decoding in phobos was and is a mistake. The biggerquestion is what I posited on stackoverflow: "Moreover, what'sthe point? Why does these metrics matter?" Similarly withstd.algorithm on strings, why would you ever want to call sorton a string? Well, I can think of a few reasons, like checkingon the frequency of letter, but I think we should see whathappens if Phobos changes from autodecoding to compile errorwhen that would occur. Then we can fix it by casting to.representation or whatever to work with code units or manuallyadding a .utfDecode to work with dchars and make the decisionexplicitly.
That'd offer a way forward and I suspect would break less codethan we might think.
* stack overflow votes are a silly thing, a somewhat easyanswer like this gets a bazillion whereas difficult questionswith difficult answers get me one, maybe two votes. oh well.


One more upvote.
I agree when you say auto decoding is a good choice.

Additonally it allows a good compatibility with the Linux API, inopposite to the Windows API since Windows unicode version useWideChars as string parameters (always two bytes.)

And finally for someone like me who makes software for his ownusage UTF-8 doesn't change anything since I'm french and everychar fits in one byte...

Re: On the meaning of string.length

Reply via email to