[power-pro] Re: Unicode: multibyte

entropyreduction Mon, 24 Aug 2009 09:09:42 -0700

--- In [email protected], "silvermoonwoman2001" <sheri...@...> wrote:
>
> --- In [email protected], "entropyreduction" 
> <alancampbelllists+yahoo@> wrote:
> 
> It seems to me that surrogate pairs are a rarity and a novelty, and special 
> support for them in the unicode plugin isn't necessary. The plugin shouldn't 
> fail to read them if they turn up in a file, nor fail to accept them if 
> present, e.g., in a string from_utf8. Just mention in the docs that character 
> counts generated by unicode.services will be overstated in this rare 
> circumstance, and that arbitrary slicing should then be avoided. You could 
> include (in the docs) a script that can identify (using regex) whether there 
> are any high code points in a unicode string.


I'll certainly do that in next release of the plugin.

I'll need to mark every service that may not work correctly.

> if (regex.pcrematch(?"[^\x{0001}-\x{FFFF}]", h_ustring, "utf8")==0) do
>   ;unicode.services are ok
> else
>   ;avoid character based unicode.services
> endif

I can also provide a service that reports whether a string is
UCS-2 (no surrogate pairs, no combining character sequence -- I already know 
how to detect former, have to look up latter).

Given all that, I suppose I could allow values > xFFFF in the create/append 
services that take numeric input (and map them to
a surrogate pairs).

[power-pro] Re: Unicode: multibyte

Reply via email to