Re: Counting "characters".

Michael B . Allen Wed, 03 Apr 2002 11:22:44 -0800

On Wed, 03 Apr 2002 14:18:53 +0100
Markus Kuhn <[EMAIL PROTECTED]> wrote:


> Michael B. Allen wrote on 2002-04-03 08:21 UTC:
> > > I'm assuming you don't have a specific application in mind, since you
> > > didn't answer Markus's question.
> > 
> > Ok, here's an example. The Document Object Model W3C spec describes some
> > 'CharacterData' methods:
> > 
> >   
>http://www.w3.org/TR/1998/REC-DOM-Level-1-19981001/level-one-core.html#ID-FF21A306
> > 
> > My C implementation of this spec has functions for these methods like:
> > 
> >   DOM_String *DOM_CharacterData_substringData(DOM_CharacterData *data, int offset, 
>int count);
> >   void DOM_CharacterData_deleteData(DOM_CharacterData *data, int offset, int 
>count);
> > 
> > These offset and count parameters are described like 'The number of
> > characters to extract' or 'The character offset at which to insert'
> > etc. THe DOM API is one of these XML peripherals and so the 'Char'
> > type ultimately defined in the XML spec here:
> > 
> >   http://www.w3.org/TR/REC-xml#charsets
> > 
> > Which at one point has an actual "definition":
> > 
> >   [Definition: A character is an atomic unit of text as specified by
> >   ISO/IEC 10646 [ISO/IEC 10646] (see also [ISO/IEC 10646-2000]).
> 
> What they probably originally wanted it to mean is "an integer number
> that can be easily converted into an address in a linear data structure
> that represents coded text". That would really mean "byte" (for UTF-8)
> or "16-bit word" (for UTF-16). Everything else would be inefficient, as
> you would have to parse the entire text each time before you can access
> the start of the addressed substring.
> 
> But they probably didn't realize this at the time of writing, and used
> the original Java notion that the 16-bit char type contains UCS-2 data
> and that there are no surrogate planes. Their definition of "character"
> now de-facto leads to inefficient access, because they must not count
> the second surrogate character in UTF-16 or the continuation bytes in
> UTF-8 as characters.
> 
> Whenever you are dealing with a variable-length encoding of characters,
> you really don't want to specify anything in terms of a number of
> characters.

So you should not use a variable-length encoding for any serious generic
string processing like in this DOM example? The DOM spec actually
*requires* UTF-16 (but I have not been successfull in getting anyone on
the W3C's DOM list to explain to me why exactly the character encoding
needs to be specified at all).

> If in C you use DOM_CharacterData = wchar_t with UCS-4, then you stay
> easily out of this, because then the offset and count parameters are
> really just indices into wchar_t arrays.

But then you have to deal with combining characters and CJK separately. My
project uses this DOM "tree" as a Model in an MVC viewer where the main
View is the terminal display. See:

  http://users.erols.com/mballen/tmvc/tmvc-plain.jpeg
  http://users.erols.com/mballen/tmvc/tmvc-resized.jpeg

This creates a little bit of a paradox. I can't help but think moving
to wchar_t will not be less efficient because of the memory consumtion
and trampling on CPU cache etc.

Mike

-- 
May The Source be with you.

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Counting "characters".

Reply via email to