Sorry, I was simplifying. The distinction I was trying to make was between 
generic operations (in my experience the majority) vs. encoding-aware ones. 

Sent from my iPhone

On Dec 31, 2011, at 12:48 PM, Michel Fortin <[email protected]> wrote:

> On 2011-12-31 16:47:40 +0000, Sean Kelly <[email protected]> said:
> 
>> I don't know that Unicode expertise is really required here anyway.  All one
>> has to know is that UTF8 is a multibyte encoding and built-in string attrib
>> utes talk in bytes. Knowing when one wants bytes vs characters isn't rocket s
>> cience.
> 
> It's not bytes vs. characters, it's code units vs. code points vs. user 
> perceived characters (grapheme clusters). One character can span multiple 
> code points, and can be represented in various ways depending on which 
> Unicode normalization you pick. But most people don't know that.
> 
> If you want to count the number of *characters*, counting code points isn't 
> really it, as you should avoid counting the combining ones. If you want to 
> search for a substring, you need to be sure both strings use the same 
> normalization first, and if not normalize them appropriately so that 
> equivalent code point combinations are always represented the same.
> 
> That said, if you are implementing an XML or JSON parser, since those specs 
> are defined in term of code points you should probably write your code in 
> term of code points (hopefully without decoding code points when you don't 
> need to). On the other hand, if you're writing something that processes text 
> (like counting the average number of *character* per word in a document), 
> then you should be aware of combining characters.
> 
> How to pack all this into an easy to use package is most challenging.
> 
> 
> -- 
> Michel Fortin
> [email protected]
> http://michelf.com/
> 

Reply via email to