--- In [email protected], "Sheri" <sheri...@...> wrote:
>
> I did some evaluation of the CaseFolding.txt file.
> 
> First, the unicode.dll text case routines, must not be handling the 
> exceptions.
> 
> If it were, this:
> 
> local ustring=unicode.new("Maße")
> unicode.messagebox("OK", ustring.change_case("upper"), "upper case")
> 
> would display "MASSE"
> 
> Instead we get "MAßE"
> 
> I think this is preferable anyway for the regex bit.

I use API functions CharUpperW, CharLowerL: whatever they spew out, is what you 
get.  If can find any other simple-ish way to case convert unicode, let me know.

> Second, the unicode plugin is not handling all the possible numeric values 
> given in the unicode tables. It apparently should go much higher than 65280.
 
> For example:
> local test=unicode.from_num(0x10400)
> ;;error, but should be DESERET CAPITAL LETTER LONG I
> unicode.messagebox("ok", test)

Unicode characters in windows, AFAIK, are defined as something that fits in a 
wide character. 

WCHAR == unsigned short int == 2 bytes == max 65,535

Goodness knows what microsnot api functions make of 0x10400.  No way to fit it 
in a WCHAR.
 
> Finally, FWIW, even using only the "C" and "S" status types in 
> CaseFolding.txt (the non-exceptions), there are a few entries where the byte 
> size seems to vary between the upper and lower case variants of the same 
> letter expressed in utf-8:
> 
> unequal utf8 lengths:2 vs 1 ;; 017F 0073 Å¿ s
> unequal utf8 lengths:2 vs 3 ;; 023A 2C65 Ⱥ ⱥ
> unequal utf8 lengths:2 vs 3 ;; 023E 2C66 Ⱦ ⱦ
> unequal utf8 lengths:3 vs 2 ;; 1E9E 00DF ẞ ß
> unequal utf8 lengths:3 vs 2 ;; 1FBE 03B9 ι ι
> unequal utf8 lengths:3 vs 2 ;; 2126 03C9 Ω ω
> unequal utf8 lengths:3 vs 1 ;; 212A 006B K k
> unequal utf8 lengths:3 vs 2 ;; 212B 00E5 Å å
> unequal utf8 lengths:3 vs 2 ;; 2C62 026B â±¢ É«
> unequal utf8 lengths:3 vs 2 ;; 2C64 027D Ɽ ɽ
> unequal utf8 lengths:3 vs 2 ;; 2C6D 0251 â±­ É`
> unequal utf8 lengths:3 vs 2 ;; 2C6E 0271 Ɱ ɱ
> unequal utf8 lengths:3 vs 2 ;; 2C6F 0250 Ɐ ɐ
 
That's useful: worst case (pun sorry) is converted string could be three times 
as long as original (e.g. if all chars 006B and want to become 212A), so if I 
pad out the buffer I use to accumulate replacement pattern results accordngly, 
should be safe.



Reply via email to