--- In [email protected], "entropyreduction"
<alancampbelllists+ya...@...> wrote:
> Anyway, if W2K gets some things wrong in other parts of the api
> besides counting characters, all the more reason to switch to a
> third party lib that's kept up to date. Eventually.
It seems to me that surrogate pairs are a rarity and a novelty, and special
support for them in the unicode plugin isn't necessary. The plugin shouldn't
fail to read them if they turn up in a file, nor fail to accept them if
present, e.g., in a string from_utf8. Just mention in the docs that character
counts generated by unicode.services will be overstated in this rare
circumstance, and that arbitrary slicing should then be avoided. You could
include (in the docs) a script that can identify (using regex) whether there
are any high code points in a unicode string.
if (regex.pcrematch(?"[^\x{0001}-\x{FFFF}]", h_ustring, "utf8")==0) do
;unicode.services are ok
else
;avoid character based unicode.services
endif
Regards,
Sheri