Re: The length of strings vs. # of chars vs. sizeof

Jérôme M. Berger Sun, 01 Nov 2009 14:10:10 -0800

Rainer Deyke wrote:

Charles Hixson wrote:

I've read and re-read the documentation, but I can't decide whether a
UTF-8 character that takes multiple bytes to express counts as one or
multiple values in length and sizeof.  Sizeof seems to presume that all
entries are the same length, but otherwise it seems to be the property I
need.  (I suppose that I could just enter a string that I know is
multi-byte chars, but it sure would be better if I could find out from
the documentation.)  I'm pretty certain that it just counts as one
character for indexing, so length would almost need to also count the
number of characters rather than bytes.


Strings are just arrays of code units.  Their length is the number of
elements (i.e. code units) they contain, just like other arrays.  A code
point may comprise multiple code units, and a logical character may
comprise multiple code points.  The latter is true even with dchar/utf-32.

So, in UTF-8, length is the number of bytes in the string and sizeof is 8 (on 32-bits systems).


                Jerome
--
mailto:[email protected]
http://jeberger.free.fr
Jabber: [email protected]

signature.asc
Description: OpenPGP digital signature

Re: The length of strings vs. # of chars vs. sizeof

Reply via email to