[power-pro] Re: unicode muti-word code points

entropyreduction Thu, 27 Aug 2009 21:41:16 -0700

Interesting.  Gives you your trial data for regex XYZ options.

I'll add > 0xFFFF => surrogate pair for the _from_num services when I get a 
change.  Now off net until next week, probably Wednesday


--- In [email protected], "silvermoonwoman2001" <sheri...@...> wrote:

> --- In [email protected], "silvermoonwoman2001" <sherip99@> wrote:
> >
> > --- In [email protected], "entropyreduction" 
> > <alancampbelllists+yahoo@> wrote:
> > >
> > > Thinking out loud:
> > > 
> > > Problem of multibyte codepoints comes in two parts: combining character 
> > > sequences and surrogate pairs.
> > > 
> > > combining character sequences:
> > > 
> > >   can give a wrong byte count, and is are one of the main ways 
> > >   you can end up with the same string represented several ways, 
> > >   so would cause comparisons to fail.  
> > > 
> > >   Normalisation form C would probably eliminate most combining 
> > >   character sequences.  I could silently normalise all unicode 
> > >   strings before storing them, providing I could locate either
> > >   normaliz.dll or icu.dll (both would have to be dynamically loaded,
> > >   because both might be absent).  
> > >    
> > >   But that means string stored might be different from string
> > >   user thinks they've stored.  Maybe not good.
> > > 
> > >   Or I could perhaps provide a normalise service, up to
> > >   user to apply it or not.
> > 
> > How about a compromise between the two,
> > unicode.normalization("On"), unicode.normalization("Off")
> > 
> > > 
> > > surrogate pairs:
> > > 
> > >   Entering them in number form not a serious problem.
> > > 
> > >   But they break current length service.  And 
> > >   get_char/set_char, maybe comparisons.  Not at all sure
> > >   what to do about them, except go whole hog and rewrite 
> > >   unicode plugin using icu as infrastructure.
> > >
> > 
> > I still think surrogate pairs would be a novelty that wouldn't
> > normally show up. It would be nice to be able to generate them
> > from their code points, but I think the primary usage would be
> > for creating test data. I wonder for example if the regex XYZ
> > case transformations could potentially corrupt unicode strings
> > that include code points above FFFF, but I need a meaningful
> > amount of test data to find out. 
> 
> With the help of this page <http://czyborra.com/utf/> I was able to make a 
> script that readily generates utf-8 strings directly from unicode code 
> points. Works well. Can call this from a for loop that appends the result to 
> a string of many, many utf-8 characters.
> 
> Function ReturnUTF8(c)
> ;c is a code point, e.g., 10400 or FFFF or whatever
> local u
> local point="U"++c
> c=eval("0x"++c)
> if (c < 0x80) do
>   u = ?"\x"++win.hex(c)
> elseif (c < 0x800)
>   u = ?"\x"++ win.hex (0xC0 | c>>6)
>   u++=?"\x"++win.hex(0x80 | c & 0x3F)
> elseif (c < 0x10000)
>   u= ?"\x"++win.hex(0xE0 | c>>12)
>   u++=?"\x"++win.hex(0x80 | c>>6 & 0x3F)
>   u++=?"\x"++win.hex(0x80 | c & 0x3F)
> elseif (c < 0x200000)
>   u = ?"\x"++win.hex(0xF0 | c>>18)
>   u++=?"\x"++win.hex(0x80 | c>>12 & 0x3F)
>   u++=?"\x"++win.hex(0x80 | c>>6 & 0x3F)
>   u++=?"\x"++win.hex(0x80 | c & 0x3F)
> endif
> win.debug(point++": "++u)
> quit(esc(u,?+\+))
> 
> Regards,
> Sheri
>

[power-pro] Re: unicode muti-word code points

Reply via email to