On Wed, 03 Apr 2002 14:18:53 +0100 Markus Kuhn <[EMAIL PROTECTED]> wrote:
> Michael B. Allen wrote on 2002-04-03 08:21 UTC: > > > I'm assuming you don't have a specific application in mind, since you > > > didn't answer Markus's question. > > > > Ok, here's an example. The Document Object Model W3C spec describes some > > 'CharacterData' methods: > > > > >http://www.w3.org/TR/1998/REC-DOM-Level-1-19981001/level-one-core.html#ID-FF21A306 > > > > My C implementation of this spec has functions for these methods like: > > > > DOM_String *DOM_CharacterData_substringData(DOM_CharacterData *data, int offset, >int count); > > void DOM_CharacterData_deleteData(DOM_CharacterData *data, int offset, int >count); > > > > These offset and count parameters are described like 'The number of > > characters to extract' or 'The character offset at which to insert' > > etc. THe DOM API is one of these XML peripherals and so the 'Char' > > type ultimately defined in the XML spec here: > > > > http://www.w3.org/TR/REC-xml#charsets > > > > Which at one point has an actual "definition": > > > > [Definition: A character is an atomic unit of text as specified by > > ISO/IEC 10646 [ISO/IEC 10646] (see also [ISO/IEC 10646-2000]). > > What they probably originally wanted it to mean is "an integer number > that can be easily converted into an address in a linear data structure > that represents coded text". That would really mean "byte" (for UTF-8) > or "16-bit word" (for UTF-16). Everything else would be inefficient, as > you would have to parse the entire text each time before you can access > the start of the addressed substring. > > But they probably didn't realize this at the time of writing, and used > the original Java notion that the 16-bit char type contains UCS-2 data > and that there are no surrogate planes. Their definition of "character" > now de-facto leads to inefficient access, because they must not count > the second surrogate character in UTF-16 or the continuation bytes in > UTF-8 as characters. > > Whenever you are dealing with a variable-length encoding of characters, > you really don't want to specify anything in terms of a number of > characters. So you should not use a variable-length encoding for any serious generic string processing like in this DOM example? The DOM spec actually *requires* UTF-16 (but I have not been successfull in getting anyone on the W3C's DOM list to explain to me why exactly the character encoding needs to be specified at all). > If in C you use DOM_CharacterData = wchar_t with UCS-4, then you stay > easily out of this, because then the offset and count parameters are > really just indices into wchar_t arrays. But then you have to deal with combining characters and CJK separately. My project uses this DOM "tree" as a Model in an MVC viewer where the main View is the terminal display. See: http://users.erols.com/mballen/tmvc/tmvc-plain.jpeg http://users.erols.com/mballen/tmvc/tmvc-resized.jpeg This creates a little bit of a paradox. I can't help but think moving to wchar_t will not be less efficient because of the memory consumtion and trampling on CPU cache etc. Mike -- May The Source be with you. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
