--- In [email protected], "entropyreduction"
<alancampbelllists+ya...@...> wrote:
>
> Thinking out loud:
>
> Problem of multibyte codepoints comes in two parts: combining character
> sequences and surrogate pairs.
>
> combining character sequences:
>
> can give a wrong byte count, and is are one of the main ways
> you can end up with the same string represented several ways,
> so would cause comparisons to fail.
>
> Normalisation form C would probably eliminate most combining
> character sequences. I could silently normalise all unicode
> strings before storing them, providing I could locate either
> normaliz.dll or icu.dll (both would have to be dynamically loaded,
> because both might be absent).
>
> But that means string stored might be different from string
> user thinks they've stored. Maybe not good.
>
> Or I could perhaps provide a normalise service, up to
> user to apply it or not.
How about a compromise between the two,
unicode.normalization("On"), unicode.normalization("Off")
>
> surrogate pairs:
>
> Entering them in number form not a serious problem.
>
> But they break current length service. And
> get_char/set_char, maybe comparisons. Not at all sure
> what to do about them, except go whole hog and rewrite
> unicode plugin using icu as infrastructure.
>
I still think surrogate pairs would be a novelty that wouldn't normally show
up. It would be nice to be able to generate them from their code points, but I
think the primary usage would be for creating test data. I wonder for example
if the regex XYZ case transformations could potentially corrupt unicode strings
that include code points above FFFF, but I need a meaningul amount of test data
to find out.
Regards,
Sheri