On Sun, Dec 22, 2019 at 06:27:03PM +0100, Robert M. Münch via 
Digitalmars-d-learn wrote:
> Want to add I'm talking about unicode strings.
> 
> Wouldn't it make sense to handle everything as UTF-32 so that
> iteration is simple because code-point = code-unit?
> 
> And later on, convert to UTF-16 or UTF-8 on demand?
[...]

Be careful that code point != "character" the way most people understand
the word "character".  The word you're looking for is "grapheme".
Which, unfortunately, is rather complex and very slow to handle in
Unicode. See std.uni.byGrapheme.

Usually you want to just stick with UTF-8 (usually) or UTF-16 (for
Windows and Java interop). UTF-32 wastes a lot of space, and *still*
doesn't give you what you think you want, and Grapheme[] is just dog
slow because of the amount of decoding/recoding needed to manipulate it.

What are you planning to do with your strings?  IME, using ~
occasionally doesn't add *too* much GC pressure, and slicing is usually
the idiomatic way of working with strings in D (it can result in faster
code than C because you don't have to keep strcpy()'d stuff all over the
place).  If you're appending string a LOT, you might want to consider
using std.array.appender in your inner loops to alleviate some of the
cost of using ~ too much.  Or use lazy evaluation and ranges to defer
actually constructing the string until the end when it's ready to be
stored.

Still, this all depends on what you're trying to do with your strings.
Elaborate a bit more about your use case, and we might be able to give
better advice.


T

-- 
Nobody is perfect.  I am Nobody. -- pepoluan, GKC forum

Reply via email to