--- In [email protected], "Sheri" <sheri...@...> wrote:
>
> > > First, the unicode.dll text case routines, must not be handling the
> > > exceptions.
> > I use API functions CharUpperW, CharLowerL: whatever they spew
> > out, is what you get. If can find any other simple-ish way to
> > case convert unicode, let me know.
>
> However it is doing it is ok with me.
I'll update docs to mention exceptions
> > > Second, the unicode plugin is not handling all the possible numeric
> > > values given in the unicode tables. It apparently should go much higher
> > > than 65280.
> >
> > > For example:
> > > local test=unicode.from_num(0x10400)
> > > ;;error, but should be DESERET CAPITAL LETTER LONG I
> > > unicode.messagebox("ok", test)
> >
> > Unicode characters in windows, AFAIK, are defined as something that fits in
> > a wide character.
> >
> > WCHAR == unsigned short int == 2 bytes == max 65,535
>
> You should change the doc for unicode.dll (it says 65280)
will do
> I learned that Unicode characters up to 0xFFFF are called "bmp characters".
> Bmp is short for "basic multilingual plane". But looking at UnicodeData.txt,
> there are more than 4600 additional characters defined beyond that in Unicode
> 5.1, plus a couple of reserved ranges. Only about 80 of them have a concept
> of text case.
>
> I confess that I don't really understand it, but Windows supposedly uses
> "surrogate pairs" for (some?) non-bmp data.
>
> <http://en.wikipedia.org/wiki/UTF-16/UCS-2>
Nice article.
http://unicode.org/iuc/iuc18/papers/a8.ppt
suggests a fairly straighforward way to convert non-bmp characters.
Who knew?
...snip....
> > That's useful: worst case (pun sorry) is converted string could
> > be three times as long as original (e.g. if all chars 006B and
> > want to become 212A), so if I pad out the buffer I use to
> > accumulate replacement pattern results accordngly, should be
> > safe.
>
> That one is not reversible. If upcasing a lower case "k" it will be "K", not
> ⪠(Capital Kelvin). So double, at worst.
Okay, thanks
> Another thing I learned is that if you convert the bytes of utf8 characters
> to binary, you can determine for each byte from its bits whether the byte is
> a single-byte ascii character, or a leading byte or continuation byte of a
> multibyte character. If it is a leading byte, how many bytes are in the
> character.
> <http://en.wikipedia.org/wiki/UTF-8>
Could be useful. Ta.