--- In [email protected], "silvermoonwoman2001" <sheri...@...> wrote:
>
> --- In [email protected], "entropyreduction" 
> <alancampbelllists+yahoo@> wrote:
> >
> > Thinking out loud:
> > 
> > Problem of multibyte codepoints comes in two parts: combining character 
> > sequences and surrogate pairs.
> > 
> > combining character sequences:
> > 
> >   can give a wrong byte count, and is are one of the main ways 
> >   you can end up with the same string represented several ways, 
> >   so would cause comparisons to fail.  
> > 
> >   Normalisation form C would probably eliminate most combining 
> >   character sequences.  I could silently normalise all unicode 
> >   strings before storing them, providing I could locate either
> >   normaliz.dll or icu.dll (both would have to be dynamically loaded,
> >   because both might be absent).  
> >    
> >   But that means string stored might be different from string
> >   user thinks they've stored.  Maybe not good.
> > 
> >   Or I could perhaps provide a normalise service, up to
> >   user to apply it or not.
> 
> How about a compromise between the two,
> unicode.normalization("On"), unicode.normalization("Off")
> 
> > 
> > surrogate pairs:
> > 
> >   Entering them in number form not a serious problem.
> > 
> >   But they break current length service.  And 
> >   get_char/set_char, maybe comparisons.  Not at all sure
> >   what to do about them, except go whole hog and rewrite 
> >   unicode plugin using icu as infrastructure.
> >
> 
> I still think surrogate pairs would be a novelty that wouldn't
> normally show up. It would be nice to be able to generate them
> from their code points, but I think the primary usage would be
> for creating test data. I wonder for example if the regex XYZ
> case transformations could potentially corrupt unicode strings
> that include code points above FFFF, but I need a meaningful
> amount of test data to find out. 

With the help of this page <http://czyborra.com/utf/> I was able to make a 
script that readily generates utf-8 strings directly from unicode code points. 
Works well. Can call this from a for loop that appends the result to a string 
of many, many utf-8 characters.

Function ReturnUTF8(c)
;c is a code point, e.g., 10400 or FFFF or whatever
local u
local point="U"++c
c=eval("0x"++c)
if (c < 0x80) do
  u = ?"\x"++win.hex(c)
elseif (c < 0x800)
  u = ?"\x"++ win.hex (0xC0 | c>>6)
  u++=?"\x"++win.hex(0x80 | c & 0x3F)
elseif (c < 0x10000)
  u= ?"\x"++win.hex(0xE0 | c>>12)
  u++=?"\x"++win.hex(0x80 | c>>6 & 0x3F)
  u++=?"\x"++win.hex(0x80 | c & 0x3F)
elseif (c < 0x200000)
  u = ?"\x"++win.hex(0xF0 | c>>18)
  u++=?"\x"++win.hex(0x80 | c>>12 & 0x3F)
  u++=?"\x"++win.hex(0x80 | c>>6 & 0x3F)
  u++=?"\x"++win.hex(0x80 | c & 0x3F)
endif
win.debug(point++": "++u)
quit(esc(u,?+\+))

Regards,
Sheri

Reply via email to