On 2019-12-23 15:05:20 +0000, H. S. Teoh said:

On Sun, Dec 22, 2019 at 06:27:03PM +0100, Robert M. Münch via Digitalmars-d-learn wrote:
Want to add I'm talking about unicode strings.

Wouldn't it make sense to handle everything as UTF-32 so that
iteration is simple because code-point = code-unit?

And later on, convert to UTF-16 or UTF-8 on demand?
[...]

Be careful that code point != "character" the way most people understand
the word "character".

I know. My point was that with UTF-8 code-points (not being a character) have different sizes. Which you need to take into account if you want to iterate by code-points.

The word you're looking for is "grapheme". Which, unfortunately, is rather complex and very slow to handle in
Unicode. See std.uni.byGrapheme.

Yes, that's when we come to "characters". And a "grapheme" can consists of several code-points. Is grapheme handling just slow in D or in general? If it's the latter, well, than that's just how it is.

Usually you want to just stick with UTF-8 (usually) or UTF-16 (for
Windows and Java interop). UTF-32 wastes a lot of space, and *still*
doesn't give you what you think you want, and Grapheme[] is just dog
slow because of the amount of decoding/recoding needed to manipulate it.

I need to handle graphemes when things are goind to be rendered and edited.

What are you planning to do with your strings?

Pretty simple: Have user editable content that is rendered using different fonts supporting unicode.

So, all editing functions: insert, replace, delete at all locations in the string supporting all unicode characters.

Viele Grüsse.

--
Robert M. Münch
http://www.saphirion.com
smarter | better | faster

Reply via email to