On Sun, Dec 22, 2019 at 06:27:03PM +0100, Robert M. Münch via Digitalmars-d-learn wrote: > Want to add I'm talking about unicode strings. > > Wouldn't it make sense to handle everything as UTF-32 so that > iteration is simple because code-point = code-unit? > > And later on, convert to UTF-16 or UTF-8 on demand? [...]
Be careful that code point != "character" the way most people understand the word "character". The word you're looking for is "grapheme". Which, unfortunately, is rather complex and very slow to handle in Unicode. See std.uni.byGrapheme. Usually you want to just stick with UTF-8 (usually) or UTF-16 (for Windows and Java interop). UTF-32 wastes a lot of space, and *still* doesn't give you what you think you want, and Grapheme[] is just dog slow because of the amount of decoding/recoding needed to manipulate it. What are you planning to do with your strings? IME, using ~ occasionally doesn't add *too* much GC pressure, and slicing is usually the idiomatic way of working with strings in D (it can result in faster code than C because you don't have to keep strcpy()'d stuff all over the place). If you're appending string a LOT, you might want to consider using std.array.appender in your inner loops to alleviate some of the cost of using ~ too much. Or use lazy evaluation and ranges to defer actually constructing the string until the end when it's ready to be stored. Still, this all depends on what you're trying to do with your strings. Elaborate a bit more about your use case, and we might be able to give better advice. T -- Nobody is perfect. I am Nobody. -- pepoluan, GKC forum