Interesting. Gives you your trial data for regex XYZ options. I'll add > 0xFFFF => surrogate pair for the _from_num services when I get a change. Now off net until next week, probably Wednesday
--- In [email protected], "silvermoonwoman2001" <sheri...@...> wrote: > --- In [email protected], "silvermoonwoman2001" <sherip99@> wrote: > > > > --- In [email protected], "entropyreduction" > > <alancampbelllists+yahoo@> wrote: > > > > > > Thinking out loud: > > > > > > Problem of multibyte codepoints comes in two parts: combining character > > > sequences and surrogate pairs. > > > > > > combining character sequences: > > > > > > can give a wrong byte count, and is are one of the main ways > > > you can end up with the same string represented several ways, > > > so would cause comparisons to fail. > > > > > > Normalisation form C would probably eliminate most combining > > > character sequences. I could silently normalise all unicode > > > strings before storing them, providing I could locate either > > > normaliz.dll or icu.dll (both would have to be dynamically loaded, > > > because both might be absent). > > > > > > But that means string stored might be different from string > > > user thinks they've stored. Maybe not good. > > > > > > Or I could perhaps provide a normalise service, up to > > > user to apply it or not. > > > > How about a compromise between the two, > > unicode.normalization("On"), unicode.normalization("Off") > > > > > > > > surrogate pairs: > > > > > > Entering them in number form not a serious problem. > > > > > > But they break current length service. And > > > get_char/set_char, maybe comparisons. Not at all sure > > > what to do about them, except go whole hog and rewrite > > > unicode plugin using icu as infrastructure. > > > > > > > I still think surrogate pairs would be a novelty that wouldn't > > normally show up. It would be nice to be able to generate them > > from their code points, but I think the primary usage would be > > for creating test data. I wonder for example if the regex XYZ > > case transformations could potentially corrupt unicode strings > > that include code points above FFFF, but I need a meaningful > > amount of test data to find out. > > With the help of this page <http://czyborra.com/utf/> I was able to make a > script that readily generates utf-8 strings directly from unicode code > points. Works well. Can call this from a for loop that appends the result to > a string of many, many utf-8 characters. > > Function ReturnUTF8(c) > ;c is a code point, e.g., 10400 or FFFF or whatever > local u > local point="U"++c > c=eval("0x"++c) > if (c < 0x80) do > u = ?"\x"++win.hex(c) > elseif (c < 0x800) > u = ?"\x"++ win.hex (0xC0 | c>>6) > u++=?"\x"++win.hex(0x80 | c & 0x3F) > elseif (c < 0x10000) > u= ?"\x"++win.hex(0xE0 | c>>12) > u++=?"\x"++win.hex(0x80 | c>>6 & 0x3F) > u++=?"\x"++win.hex(0x80 | c & 0x3F) > elseif (c < 0x200000) > u = ?"\x"++win.hex(0xF0 | c>>18) > u++=?"\x"++win.hex(0x80 | c>>12 & 0x3F) > u++=?"\x"++win.hex(0x80 | c>>6 & 0x3F) > u++=?"\x"++win.hex(0x80 | c & 0x3F) > endif > win.debug(point++": "++u) > quit(esc(u,?+\+)) > > Regards, > Sheri >
