On Thursday, 6 September 2018 at 16:44:11 UTC, H. S. Teoh wrote:
On Thu, Sep 06, 2018 at 02:42:58PM +0000, Dukc via Digitalmars-d wrote:
On Thursday, 6 September 2018 at 14:17:28 UTC, aliak wrote:
> // D
> auto a = "á";
> auto b = "á";
> auto c = "\u200B";
> auto x = a ~ c ~ a;
> auto y = b ~ c ~ b;
> > writeln(a.length); // 2 wtf
> writeln(b.length); // 3 wtf
> writeln(x.length); // 7 wtf
> writeln(y.length); // 9 wtf
[...]

This is an unfair comparison. In the Swift version you used .count, but here you used .length, which is the length of the array, NOT the number of characters or whatever you expect it to be. You should rather use .count and specify exactly what you want to count, e.g., byCodePoint or byGrapheme.

I suspect the Swift version will give you unexpected results if you did something like compare "á" to "a\u301", for example (which, in case it isn't obvious, are visually identical to each other, and as far as an end user is concerned, should only count as 1 grapheme).

Not even normalization will help you if you have a string like "a\u301\u302": in that case, the *only* correct way to count the number of visual characters is byGrapheme, and I highly doubt Swift's .count will give you the correct answer in that case. (I expect that Swift's .count will count code points, as is the usual default in many languages, which is unfortunately wrong when you're thinking about visual characters, which are called graphemes in Unicode parlance.)

No, Swift counts grapheme clusters by default, so it gives 1. I suggest you read the linked Swift chapter above. I think it's the wrong choice for performance, but they chose to emphasize intuitiveness for the common case.

I agree with most of the rest of what you wrote about programmers having no silver bullet to avoid Unicode's and languages' complexity.

Reply via email to