On Thursday, 6 September 2018 at 16:44:11 UTC, H. S. Teoh wrote:
On Thu, Sep 06, 2018 at 02:42:58PM +0000, Dukc via
Digitalmars-d wrote:
On Thursday, 6 September 2018 at 14:17:28 UTC, aliak wrote:
> // D
> auto a = "á";
> auto b = "á";
> auto c = "\u200B";
> auto x = a ~ c ~ a;
> auto y = b ~ c ~ b;
>
> writeln(a.length); // 2 wtf
> writeln(b.length); // 3 wtf
> writeln(x.length); // 7 wtf
> writeln(y.length); // 9 wtf
[...]
This is an unfair comparison. In the Swift version you used
.count, but here you used .length, which is the length of the
array, NOT the number of characters or whatever you expect it
to be. You should rather use .count and specify exactly what
you want to count, e.g., byCodePoint or byGrapheme.
I suspect the Swift version will give you unexpected results if
you did something like compare "á" to "a\u301", for example
(which, in case it isn't obvious, are visually identical to
each other, and as far as an end user is concerned, should only
count as 1 grapheme).
Not even normalization will help you if you have a string like
"a\u301\u302": in that case, the *only* correct way to count
the number of visual characters is byGrapheme, and I highly
doubt Swift's .count will give you the correct answer in that
case. (I expect that Swift's .count will count code points, as
is the usual default in many languages, which is unfortunately
wrong when you're thinking about visual characters, which are
called graphemes in Unicode parlance.)
No, Swift counts grapheme clusters by default, so it gives 1. I
suggest you read the linked Swift chapter above. I think it's the
wrong choice for performance, but they chose to emphasize
intuitiveness for the common case.
I agree with most of the rest of what you wrote about programmers
having no silver bullet to avoid Unicode's and languages'
complexity.