Sorry, I was simplifying. The distinction I was trying to make was between generic operations (in my experience the majority) vs. encoding-aware ones.
Sent from my iPhone On Dec 31, 2011, at 12:48 PM, Michel Fortin <[email protected]> wrote: > On 2011-12-31 16:47:40 +0000, Sean Kelly <[email protected]> said: > >> I don't know that Unicode expertise is really required here anyway. All one >> has to know is that UTF8 is a multibyte encoding and built-in string attrib >> utes talk in bytes. Knowing when one wants bytes vs characters isn't rocket s >> cience. > > It's not bytes vs. characters, it's code units vs. code points vs. user > perceived characters (grapheme clusters). One character can span multiple > code points, and can be represented in various ways depending on which > Unicode normalization you pick. But most people don't know that. > > If you want to count the number of *characters*, counting code points isn't > really it, as you should avoid counting the combining ones. If you want to > search for a substring, you need to be sure both strings use the same > normalization first, and if not normalize them appropriately so that > equivalent code point combinations are always represented the same. > > That said, if you are implementing an XML or JSON parser, since those specs > are defined in term of code points you should probably write your code in > term of code points (hopefully without decoding code points when you don't > need to). On the other hand, if you're writing something that processes text > (like counting the average number of *character* per word in a document), > then you should be aware of combining characters. > > How to pack all this into an easy to use package is most challenging. > > > -- > Michel Fortin > [email protected] > http://michelf.com/ >
