On Thursday, 6 September 2018 at 10:22:22 UTC, ag0aep6g wrote:
On 09/06/2018 09:23 AM, Chris wrote:
Python 3 gives me this:
print(len("á"))
1
Python 3 also gives you this:
print(len("á"))
2
(The example might not survive transfer from me to you if
Unicode normalization happens along the way.)
That's when you enter the 'á' as 'a' followed by U+0301
(combining acute accent). So Python's `len` counts in code
points, like D's std.range does (auto-decoding).
To avoid this you have to normalize and recompose any decomposed
characters. I remember that Mac OS X used (and still uses?)
decomposed characters by default, so when you typed 'á' into your
cli, it would automatically decompose it to 'a' + acute. `string`
however returns len=2 for composed characters too. If you do a
lot of string handling it will come back to bite you sooner or
later.