--- In [email protected], "silvermoonwoman2001" <sheri...@...> wrote: > > --- In [email protected], "entropyreduction" > <alancampbelllists+yahoo@> wrote: > > > > Thinking out loud: > > > > Problem of multibyte codepoints comes in two parts: combining character > > sequences and surrogate pairs. > > > > combining character sequences: > > > > can give a wrong byte count, and is are one of the main ways > > you can end up with the same string represented several ways, > > so would cause comparisons to fail. > > > > Normalisation form C would probably eliminate most combining > > character sequences. I could silently normalise all unicode > > strings before storing them, providing I could locate either > > normaliz.dll or icu.dll (both would have to be dynamically loaded, > > because both might be absent). > > > > But that means string stored might be different from string > > user thinks they've stored. Maybe not good. > > > > Or I could perhaps provide a normalise service, up to > > user to apply it or not. > > How about a compromise between the two, > unicode.normalization("On"), unicode.normalization("Off") > > > > > surrogate pairs: > > > > Entering them in number form not a serious problem. > > > > But they break current length service. And > > get_char/set_char, maybe comparisons. Not at all sure > > what to do about them, except go whole hog and rewrite > > unicode plugin using icu as infrastructure. > > > > I still think surrogate pairs would be a novelty that wouldn't > normally show up. It would be nice to be able to generate them > from their code points, but I think the primary usage would be > for creating test data. I wonder for example if the regex XYZ > case transformations could potentially corrupt unicode strings > that include code points above FFFF, but I need a meaningful > amount of test data to find out.
With the help of this page <http://czyborra.com/utf/> I was able to make a script that readily generates utf-8 strings directly from unicode code points. Works well. Can call this from a for loop that appends the result to a string of many, many utf-8 characters. Function ReturnUTF8(c) ;c is a code point, e.g., 10400 or FFFF or whatever local u local point="U"++c c=eval("0x"++c) if (c < 0x80) do u = ?"\x"++win.hex(c) elseif (c < 0x800) u = ?"\x"++ win.hex (0xC0 | c>>6) u++=?"\x"++win.hex(0x80 | c & 0x3F) elseif (c < 0x10000) u= ?"\x"++win.hex(0xE0 | c>>12) u++=?"\x"++win.hex(0x80 | c>>6 & 0x3F) u++=?"\x"++win.hex(0x80 | c & 0x3F) elseif (c < 0x200000) u = ?"\x"++win.hex(0xF0 | c>>18) u++=?"\x"++win.hex(0x80 | c>>12 & 0x3F) u++=?"\x"++win.hex(0x80 | c>>6 & 0x3F) u++=?"\x"++win.hex(0x80 | c & 0x3F) endif win.debug(point++": "++u) quit(esc(u,?+\+)) Regards, Sheri
