[power-pro] Re: Unicode bugs? now regex issues

Sheri Tue, 18 Aug 2009 12:26:03 -0700

--- In [email protected], "entropyreduction" 
<alancampbelllists+ya...@...> wrote:
>
> --- In [email protected], "Sheri" <sherip99@> wrote:
> >
> > I did some evaluation of the CaseFolding.txt file.
> > 
> > First, the unicode.dll text case routines, must not be handling the 
> > exceptions.
> > 
> > If it were, this:
> > 
> > local ustring=unicode.new("Maße")
> > unicode.messagebox("OK", ustring.change_case("upper"), "upper case")
> > 
> > would display "MASSE"
> > 
> > Instead we get "MAßE"
> > 
> > I think this is preferable anyway for the regex bit.
> 
> I use API functions CharUpperW, CharLowerL: whatever they spew
> out, is what you get. If can find any other simple-ish way to
> case convert unicode, let me know.


However it is doing it is ok with me.

> 
> > Second, the unicode plugin is not handling all the possible numeric values 
> > given in the unicode tables. It apparently should go much higher than 65280.
>  
> > For example:
> > local test=unicode.from_num(0x10400)
> > ;;error, but should be DESERET CAPITAL LETTER LONG I
> > unicode.messagebox("ok", test)
> 
> Unicode characters in windows, AFAIK, are defined as something that fits in a 
> wide character. 
> 
> WCHAR == unsigned short int == 2 bytes == max 65,535

You should change the doc for unicode.dll (it says 65280)

> 
> Goodness knows what microsnot api functions make of 0x10400. No
> way to fit it in a WCHAR.

LOL.

I learned that Unicode characters up to 0xFFFF are called "bmp characters". Bmp 
is short for "basic multilingual plane". But looking at UnicodeData.txt, there 
are more than 4600 additional characters defined beyond that in Unicode 5.1, 
plus a couple of reserved ranges. Only about 80 of them have a concept of text 
case.

I confess that I don't really understand it, but Windows supposedly uses 
"surrogate pairs" for (some?) non-bmp data.

<http://en.wikipedia.org/wiki/UTF-16/UCS-2>

>  
> > Finally, FWIW, even using only the "C" and "S" status types in
> > CaseFolding.txt (the non-exceptions), there are a few entries
> > where the byte size seems to vary between the upper and lower
> > case variants of the same letter expressed in utf-8:
> > 
> > unequal utf8 lengths:2 vs 1 ;; 017F 0073 Å¿ s
> > unequal utf8 lengths:2 vs 3 ;; 023A 2C65 Èº â±¥
> > unequal utf8 lengths:2 vs 3 ;; 023E 2C66 È¾ â±¦
> > unequal utf8 lengths:3 vs 2 ;; 1E9E 00DF áº Ã
> > unequal utf8 lengths:3 vs 2 ;; 1FBE 03B9 á¾¾ Î¹
> > unequal utf8 lengths:3 vs 2 ;; 2126 03C9 â¦ Ï
> > unequal utf8 lengths:3 vs 1 ;; 212A 006B âª k
> > unequal utf8 lengths:3 vs 2 ;; 212B 00E5 â« Ã¥
> > unequal utf8 lengths:3 vs 2 ;; 2C62 026B â±¢ É«
> > unequal utf8 lengths:3 vs 2 ;; 2C64 027D â±¤ É½
> > unequal utf8 lengths:3 vs 2 ;; 2C6D 0251 â± É`
> > unequal utf8 lengths:3 vs 2 ;; 2C6E 0271 â±® É±
> > unequal utf8 lengths:3 vs 2 ;; 2C6F 0250 â±¯ É
>  
> That's useful: worst case (pun sorry) is converted string could
> be three times as long as original (e.g. if all chars 006B and
> want to become 212A), so if I pad out the buffer I use to
> accumulate replacement pattern results accordngly, should be
> safe. 

That one is not reversible. If upcasing a lower case "k" it will be "K", not 
âª (Capital Kelvin). So double, at worst.

Another thing I learned is that if you convert the bytes of utf8 characters to 
binary, you can determine for each byte from its bits whether the byte is a 
single-byte ascii character, or a leading byte or continuation byte of a 
multibyte character. If it is a leading byte, how many bytes are in the 
character.

<http://en.wikipedia.org/wiki/UTF-8>

Regards,
Sheri

[power-pro] Re: Unicode bugs? now regex issues

Reply via email to