--- In [email protected], "entropyreduction" 
<alancampbelllists+ya...@...> wrote:
>
> Thinking out loud:
> 
> Problem of multibyte codepoints comes in two parts: combining character 
> sequences and surrogate pairs.
> 
> combining character sequences:
> 
>   can give a wrong byte count, and is are one of the main ways 
>   you can end up with the same string represented several ways, 
>   so would cause comparisons to fail.  
> 
>   Normalisation form C would probably eliminate most combining 
>   character sequences.  I could silently normalise all unicode 
>   strings before storing them, providing I could locate either
>   normaliz.dll or icu.dll (both would have to be dynamically loaded,
>   because both might be absent).  
>    
>   But that means string stored might be different from string
>   user thinks they've stored.  Maybe not good.
> 
>   Or I could perhaps provide a normalise service, up to
>   user to apply it or not.

How about a compromise between the two,
unicode.normalization("On"), unicode.normalization("Off")

> 
> surrogate pairs:
> 
>   Entering them in number form not a serious problem.
> 
>   But they break current length service.  And 
>   get_char/set_char, maybe comparisons.  Not at all sure
>   what to do about them, except go whole hog and rewrite 
>   unicode plugin using icu as infrastructure.
>

I still think surrogate pairs would be a novelty that wouldn't normally show 
up. It would be nice to be able to generate them from their code points, but I 
think the primary usage would be for creating test data. I wonder for example 
if the regex XYZ case transformations could potentially corrupt unicode strings 
that include code points above FFFF, but I need a meaningul amount of test data 
to find out.

Regards,
Sheri

Reply via email to