On 2019-12-23 15:05:20 +0000, H. S. Teoh said:
On Sun, Dec 22, 2019 at 06:27:03PM +0100, Robert M. Münch via
Want to add I'm talking about unicode strings.
Wouldn't it make sense to handle everything as UTF-32 so that
iteration is simple because code-point = code-unit?
And later on, convert to UTF-16 or UTF-8 on demand?
Be careful that code point != "character" the way most people understand
the word "character".
I know. My point was that with UTF-8 code-points (not being a
character) have different sizes. Which you need to take into account if
you want to iterate by code-points.
The word you're looking for is "grapheme". Which, unfortunately, is
rather complex and very slow to handle in
Unicode. See std.uni.byGrapheme.
Yes, that's when we come to "characters". And a "grapheme" can consists
of several code-points. Is grapheme handling just slow in D or in
general? If it's the latter, well, than that's just how it is.
Usually you want to just stick with UTF-8 (usually) or UTF-16 (for
Windows and Java interop). UTF-32 wastes a lot of space, and *still*
doesn't give you what you think you want, and Grapheme is just dog
slow because of the amount of decoding/recoding needed to manipulate it.
I need to handle graphemes when things are goind to be rendered and edited.
What are you planning to do with your strings?
Pretty simple: Have user editable content that is rendered using
different fonts supporting unicode.
So, all editing functions: insert, replace, delete at all locations in
the string supporting all unicode characters.
Robert M. Münch
smarter | better | faster