On 2011-12-31 16:47:40 +0000, Sean Kelly <[email protected]> said:

I don't know that Unicode expertise is really required here anyway.  All one
 has to know is that UTF8 is a multibyte encoding and built-in string attrib
utes talk in bytes. Knowing when one wants bytes vs characters isn't rocket s
cience.

It's not bytes vs. characters, it's code units vs. code points vs. user perceived characters (grapheme clusters). One character can span multiple code points, and can be represented in various ways depending on which Unicode normalization you pick. But most people don't know that.

If you want to count the number of *characters*, counting code points isn't really it, as you should avoid counting the combining ones. If you want to search for a substring, you need to be sure both strings use the same normalization first, and if not normalize them appropriately so that equivalent code point combinations are always represented the same.

That said, if you are implementing an XML or JSON parser, since those specs are defined in term of code points you should probably write your code in term of code points (hopefully without decoding code points when you don't need to). On the other hand, if you're writing something that processes text (like counting the average number of *character* per word in a document), then you should be aware of combining characters.

How to pack all this into an easy to use package is most challenging.


--
Michel Fortin
[email protected]
http://michelf.com/

Reply via email to