Re: string is rarely useful as a function argument

Michel Fortin Sat, 31 Dec 2011 12:51:05 -0800

On 2011-12-31 16:47:40 +0000, Sean Kelly <[email protected]> said:

I don't know that Unicode expertise is really required here anyway.  All one
 has to know is that UTF8 is a multibyte encoding and built-in string attrib
utes talk in bytes. Knowing when one wants bytes vs characters isn't rocket s
cience.

It's not bytes vs. characters, it's code units vs. code points vs. userperceived characters (grapheme clusters). One character can spanmultiple code points, and can be represented in various ways dependingon which Unicode normalization you pick. But most people don't knowthat.

If you want to count the number of *characters*, counting code pointsisn't really it, as you should avoid counting the combining ones. Ifyou want to search for a substring, you need to be sure both stringsuse the same normalization first, and if not normalize themappropriately so that equivalent code point combinations are alwaysrepresented the same.

That said, if you are implementing an XML or JSON parser, since thosespecs are defined in term of code points you should probably write yourcode in term of code points (hopefully without decoding code pointswhen you don't need to). On the other hand, if you're writing somethingthat processes text (like counting the average number of *character*per word in a document), then you should be aware of combiningcharacters.


How to pack all this into an easy to use package is most challenging.


--
Michel Fortin
[email protected]
http://michelf.com/

Re: string is rarely useful as a function argument

Reply via email to