Thinking out loud: Problem of multibyte codepoints comes in two parts: combining character sequences and surrogate pairs.
combining character sequences: can give a wrong byte count, and is are one of the main ways you can end up with the same string represented several ways, so would cause comparisons to fail. Normalisation form C would probably eliminate most combining character sequences. I could silently normalise all unicode strings before storing them, providing I could locate either normaliz.dll or icu.dll (both would have to be dynamically loaded, because both might be absent). But that means string stored might be different from string user thinks they've stored. Maybe not good. Or I could perhaps provide a normalise service, up to user to apply it or not. surrogate pairs: Entering them in number form not a serious problem. But they break current length service. And get_char/set_char, maybe comparisons. Not at all sure what to do about them, except go whole hog and rewrite unicode plugin using icu as infrastructure.
