--- In [email protected], "silvermoonwoman2001" <sheri...@...> wrote:
>
> --- In [email protected], "entropyreduction"
> <alancampbelllists+yahoo@> wrote:
>
> You could include (in the docs) a script that can identify (using
> regex) whether there are any high code points in a unicode string.
>
> if (regex.pcrematch(?"[^\x{0001}-\x{FFFF}]", h_ustring, "utf8")==0) do
> ;unicode.services are ok
> else
> ;avoid character based unicode.services
> endif
Sorry, don't understand. h_ustring converted to utf-8, so its made up of a
string of 8 bit bytes, all of which by definition have to be in range
\x{0001}-\x{FFFF}, surely? I'd need to recognise the UTF-8 equivalents of
Combining Diacritical Marks (0300-036F)
Combining Diacritical Marks Supplement (1DC0-1DFF)
Combining Diacritical Marks for Symbols (20D0-20FF)
Combining Half Marks (FE20-FE2F)
http://en.wikipedia.org/wiki/Combining_character#Unicode_ranges
first surrogate: 0xD8000xDBFF
second surrogate: 0xDC00-0xDFFF
http://en.wikipedia.org/wiki/UTF-16/UCS-2#Encoding_of_characters_outside_the_BMP
which sounds horrific: probably easier for me to check for those ranges in a
UTF-16 string.