[power-pro] Re: Unicode: multibyte

entropyreduction Mon, 24 Aug 2009 20:05:07 -0700

--- In [email protected], "silvermoonwoman2001" <sheri...@...> wrote:
>
> --- In [email protected], "entropyreduction" 
> <alancampbelllists+yahoo@> wrote:
> 
> You could include (in the docs) a script that can identify (using 
> regex) whether there are any high code points in a unicode string.
> 
> if (regex.pcrematch(?"[^\x{0001}-\x{FFFF}]", h_ustring, "utf8")==0) do
>   ;unicode.services are ok
> else
>   ;avoid character based unicode.services
> endif


Sorry, don't understand. h_ustring converted to utf-8, so its made up of a 
string of 8 bit bytes, all of which by definition have to be in range 
\x{0001}-\x{FFFF}, surely?  I'd need to recognise the UTF-8 equivalents of 

Combining Diacritical Marks (0300-036F)
Combining Diacritical Marks Supplement (1DC0-1DFF)
Combining Diacritical Marks for Symbols (20D0-20FF)
Combining Half Marks (FE20-FE2F)
http://en.wikipedia.org/wiki/Combining_character#Unicode_ranges

first surrogate: 0xD8000xDBFF
second surrogate: 0xDC00-0xDFFF 
http://en.wikipedia.org/wiki/UTF-16/UCS-2#Encoding_of_characters_outside_the_BMP

which sounds horrific: probably easier for me to check for those ranges in a 
UTF-16 string.

[power-pro] Re: Unicode: multibyte

Reply via email to