Re: Why the hell doesn't foreach decode strings

Andrei Alexandrescu Sat, 29 Oct 2011 07:45:39 -0700

On 10/26/11 7:18 AM, Steven Schveighoffer wrote:

On Mon, 24 Oct 2011 19:49:43 -0400, Simen Kjaeraas
<simen.kja...@gmail.com> wrote:

On Mon, 24 Oct 2011 21:41:57 +0200, Steven Schveighoffer
<schvei...@yahoo.com> wrote:

Plus, a combining character (such as an umlaut or accent) is part of a
character, but may be a separate code point.


If this is correct (and it is), then decoding to dchar is simply not
enough.
You seem to advocate decoding to graphemes, which is a whole different
matter.


I am advocating that. And it's a matter of perception. D can say "we
only support code-point decoding" and what that means to a user is, "we
don't support language as you know it." Sure it's a part of unicode, but
it takes that extra piece to make it actually usable to people who
require unicode.

Even in English, fiancé has an accent. To say D supports unicode, but
then won't do a simple search on a file which contains a certain *valid*
encoding of that word is disingenuous to say the least.


Why doesn't that simple search work?

foreach (line; stdin.byLine()) {
    if (line.canFind("fiancé")) {
       writeln("There it is.");
    }
}

D needs a fully unicode-aware string type. I advocate D should use it as
the default string type, but it needs one whether it's the default or
not in order to say it supports unicode.

How do you define "supports Unicode"? For my money, the main sin of(w)string is that it offers [] and .length with potentially confusingsemantics, so if I could I'd curb, not expand, its interface.



Andrei

Re: Why the hell doesn't foreach decode strings

Reply via email to