[power-pro] Re: Unicode bugs? now regex issues

entropyreduction Tue, 18 Aug 2009 21:54:00 -0700

--- In [email protected], "Sheri" <sheri...@...> wrote:
>
> > > First, the unicode.dll text case routines, must not be handling the 
> > > exceptions.


> > I use API functions CharUpperW, CharLowerL: whatever they spew
> > out, is what you get. If can find any other simple-ish way to
> > case convert unicode, let me know.
> 
> However it is doing it is ok with me.

I'll update docs to mention exceptions
 

> > > Second, the unicode plugin is not handling all the possible numeric 
> > > values given in the unicode tables. It apparently should go much higher 
> > > than 65280.
> >  
> > > For example:
> > > local test=unicode.from_num(0x10400)
> > > ;;error, but should be DESERET CAPITAL LETTER LONG I
> > > unicode.messagebox("ok", test)
> > 
> > Unicode characters in windows, AFAIK, are defined as something that fits in 
> > a wide character. 
> > 
> > WCHAR == unsigned short int == 2 bytes == max 65,535
> 
> You should change the doc for unicode.dll (it says 65280)

will do
 
> I learned that Unicode characters up to 0xFFFF are called "bmp characters". 
> Bmp is short for "basic multilingual plane". But looking at UnicodeData.txt, 
> there are more than 4600 additional characters defined beyond that in Unicode 
> 5.1, plus a couple of reserved ranges. Only about 80 of them have a concept 
> of text case.
> 
> I confess that I don't really understand it, but Windows supposedly uses 
> "surrogate pairs" for (some?) non-bmp data.
> 
> <http://en.wikipedia.org/wiki/UTF-16/UCS-2>

Nice article.  

http://unicode.org/iuc/iuc18/papers/a8.ppt

suggests a fairly straighforward way to convert non-bmp characters.
Who knew?


...snip....

> > That's useful: worst case (pun sorry) is converted string could
> > be three times as long as original (e.g. if all chars 006B and
> > want to become 212A), so if I pad out the buffer I use to
> > accumulate replacement pattern results accordngly, should be
> > safe. 
> 
> That one is not reversible. If upcasing a lower case "k" it will be "K", not 
> âª (Capital Kelvin). So double, at worst.

Okay, thanks
 
> Another thing I learned is that if you convert the bytes of utf8 characters 
> to binary, you can determine for each byte from its bits whether the byte is 
> a single-byte ascii character, or a leading byte or continuation byte of a 
> multibyte character. If it is a leading byte, how many bytes are in the 
> character.
 
> <http://en.wikipedia.org/wiki/UTF-8>
 
Could be useful.  Ta.

[power-pro] Re: Unicode bugs? now regex issues

Reply via email to