[power-pro] Re: Unicode bugs? now regex issues

entropyreduction Tue, 18 Aug 2009 08:58:31 -0700

--- In [email protected], "Sheri" <sheri...@...> wrote:
>
> I did some evaluation of the CaseFolding.txt file.
> 
> First, the unicode.dll text case routines, must not be handling the 
> exceptions.
> 
> If it were, this:
> 
> local ustring=unicode.new("Maße")
> unicode.messagebox("OK", ustring.change_case("upper"), "upper case")
> 
> would display "MASSE"
> 
> Instead we get "MAßE"
> 
> I think this is preferable anyway for the regex bit.


I use API functions CharUpperW, CharLowerL: whatever they spew out, is what you 
get.  If can find any other simple-ish way to case convert unicode, let me know.

> Second, the unicode plugin is not handling all the possible numeric values 
> given in the unicode tables. It apparently should go much higher than 65280.
 
> For example:
> local test=unicode.from_num(0x10400)
> ;;error, but should be DESERET CAPITAL LETTER LONG I
> unicode.messagebox("ok", test)

Unicode characters in windows, AFAIK, are defined as something that fits in a 
wide character. 

WCHAR == unsigned short int == 2 bytes == max 65,535

Goodness knows what microsnot api functions make of 0x10400.  No way to fit it 
in a WCHAR.
 
> Finally, FWIW, even using only the "C" and "S" status types in 
> CaseFolding.txt (the non-exceptions), there are a few entries where the byte 
> size seems to vary between the upper and lower case variants of the same 
> letter expressed in utf-8:
> 
> unequal utf8 lengths:2 vs 1 ;; 017F 0073 Å¿ s
> unequal utf8 lengths:2 vs 3 ;; 023A 2C65 Èº â±¥
> unequal utf8 lengths:2 vs 3 ;; 023E 2C66 È¾ â±¦
> unequal utf8 lengths:3 vs 2 ;; 1E9E 00DF áº Ã
> unequal utf8 lengths:3 vs 2 ;; 1FBE 03B9 á¾¾ Î¹
> unequal utf8 lengths:3 vs 2 ;; 2126 03C9 â¦ Ï
> unequal utf8 lengths:3 vs 1 ;; 212A 006B âª k
> unequal utf8 lengths:3 vs 2 ;; 212B 00E5 â« Ã¥
> unequal utf8 lengths:3 vs 2 ;; 2C62 026B â±¢ É«
> unequal utf8 lengths:3 vs 2 ;; 2C64 027D â±¤ É½
> unequal utf8 lengths:3 vs 2 ;; 2C6D 0251 â± É`
> unequal utf8 lengths:3 vs 2 ;; 2C6E 0271 â±® É±
> unequal utf8 lengths:3 vs 2 ;; 2C6F 0250 â±¯ É
 
That's useful: worst case (pun sorry) is converted string could be three times 
as long as original (e.g. if all chars 006B and want to become 212A), so if I 
pad out the buffer I use to accumulate replacement pattern results accordngly, 
should be safe.

[power-pro] Re: Unicode bugs? now regex issues

Reply via email to