Re: Counting "characters".

Markus Kuhn Wed, 03 Apr 2002 05:00:06 -0800

Michael B. Allen wrote on 2002-04-03 08:21 UTC:
> > I'm assuming you don't have a specific application in mind, since you
> > didn't answer Markus's question.
> 
> Ok, here's an example. The Document Object Model W3C spec describes some
> 'CharacterData' methods:
> 
>   http://www.w3.org/TR/1998/REC-DOM-Level-1-19981001/level-one-core.html#ID-FF21A306
> 
> My C implementation of this spec has functions for these methods like:
> 
>   DOM_String *DOM_CharacterData_substringData(DOM_CharacterData *data, int offset, 
>int count);
>   void DOM_CharacterData_deleteData(DOM_CharacterData *data, int offset, int count);
> 
> These offset and count parameters are described like 'The number of
> characters to extract' or 'The character offset at which to insert'
> etc. THe DOM API is one of these XML peripherals and so the 'Char'
> type ultimately defined in the XML spec here:
> 
>   http://www.w3.org/TR/REC-xml#charsets
> 
> Which at one point has an actual "definition":
> 
>   [Definition: A character is an atomic unit of text as specified by
>   ISO/IEC 10646 [ISO/IEC 10646] (see also [ISO/IEC 10646-2000]).


What they probably originally wanted it to mean is "an integer number
that can be easily converted into an address in a linear data structure
that represents coded text". That would really mean "byte" (for UTF-8)
or "16-bit word" (for UTF-16). Everything else would be inefficient, as
you would have to parse the entire text each time before you can access
the start of the addressed substring.

But they probably didn't realize this at the time of writing, and used
the original Java notion that the 16-bit char type contains UCS-2 data
and that there are no surrogate planes. Their definition of "character"
now de-facto leads to inefficient access, because they must not count
the second surrogate character in UTF-16 or the continuation bytes in
UTF-8 as characters.

Whenever you are dealing with a variable-length encoding of characters,
you really don't want to specify anything in terms of a number of
characters.

If in C you use DOM_CharacterData = wchar_t with UCS-4, then you stay
easily out of this, because then the offset and count parameters are
really just indices into wchar_t arrays.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Counting "characters".

Reply via email to