--- In [email protected], "Sheri" <sheri...@...> wrote:
>
> I did some evaluation of the CaseFolding.txt file.
>
> First, the unicode.dll text case routines, must not be handling the
> exceptions.
>
> If it were, this:
>
> local ustring=unicode.new("Maße")
> unicode.messagebox("OK", ustring.change_case("upper"), "upper case")
>
> would display "MASSE"
>
> Instead we get "MAßE"
>
> I think this is preferable anyway for the regex bit.
I use API functions CharUpperW, CharLowerL: whatever they spew out, is what you
get. If can find any other simple-ish way to case convert unicode, let me know.
> Second, the unicode plugin is not handling all the possible numeric values
> given in the unicode tables. It apparently should go much higher than 65280.
> For example:
> local test=unicode.from_num(0x10400)
> ;;error, but should be DESERET CAPITAL LETTER LONG I
> unicode.messagebox("ok", test)
Unicode characters in windows, AFAIK, are defined as something that fits in a
wide character.
WCHAR == unsigned short int == 2 bytes == max 65,535
Goodness knows what microsnot api functions make of 0x10400. No way to fit it
in a WCHAR.
> Finally, FWIW, even using only the "C" and "S" status types in
> CaseFolding.txt (the non-exceptions), there are a few entries where the byte
> size seems to vary between the upper and lower case variants of the same
> letter expressed in utf-8:
>
> unequal utf8 lengths:2 vs 1 ;; 017F 0073 Å¿ s
> unequal utf8 lengths:2 vs 3 ;; 023A 2C65 Ⱥ ⱥ
> unequal utf8 lengths:2 vs 3 ;; 023E 2C66 Ⱦ ⱦ
> unequal utf8 lengths:3 vs 2 ;; 1E9E 00DF ẠÃ
> unequal utf8 lengths:3 vs 2 ;; 1FBE 03B9 ι ι
> unequal utf8 lengths:3 vs 2 ;; 2126 03C9 ⦠Ï
> unequal utf8 lengths:3 vs 1 ;; 212A 006B ⪠k
> unequal utf8 lengths:3 vs 2 ;; 212B 00E5 ⫠å
> unequal utf8 lengths:3 vs 2 ;; 2C62 026B â±¢ É«
> unequal utf8 lengths:3 vs 2 ;; 2C64 027D Ɽ ɽ
> unequal utf8 lengths:3 vs 2 ;; 2C6D 0251 â± É`
> unequal utf8 lengths:3 vs 2 ;; 2C6E 0271 Ɱ ɱ
> unequal utf8 lengths:3 vs 2 ;; 2C6F 0250 Ɐ É
That's useful: worst case (pun sorry) is converted string could be three times
as long as original (e.g. if all chars 006B and want to become 212A), so if I
pad out the buffer I use to accumulate replacement pattern results accordngly,
should be safe.